**LET´S BEGIN THE CLEANING - Donald Trump Tweets**

    1. Import the needed libraries
    2. Import CSV of Donald Trump tweets and create a dataframe df_trump
    3. Initial Analysis of the df

In [1]:
import pandas as pd
import numpy as np
import sidetable #bonus
import datetime

In [2]:
df_trump = pd.read_csv('../data/realdonaldtrump.csv')
df_trump.head(2)

Unnamed: 0,id,link,content,date,retweets,favorites,mentions,hashtags
0,1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1,1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,


In [3]:
df_trump.shape
#all looking good

(43352, 8)

In [4]:
df_trump.dtypes

#types are good besides the date, I want it as date time, not string. 

id            int64
link         object
content      object
date         object
retweets      int64
favorites     int64
mentions     object
hashtags     object
dtype: object

**Note_1:** change the "date" column type to datetime

In [5]:
df_trump['date'] = pd.to_datetime(df_trump['date'], format='%Y%m%d %H:%M:%S')

In [6]:
df_trump.dtypes

id                    int64
link                 object
content              object
date         datetime64[ns]
retweets              int64
favorites             int64
mentions             object
hashtags             object
dtype: object

**Note_1:** Solved. Date column changed to datetime.

In [7]:
df_trump.columns

Index(['id', 'link', 'content', 'date', 'retweets', 'favorites', 'mentions',
       'hashtags'],
      dtype='object')

**Note_2:** columns are homogeneous :) , the 'id' column can be checked as unique ID to be replaced:

    1.First let me check if there are any duplicates to see if I should make it as the id column
    2.Set id column for 'id'

In [8]:
df_trump.duplicated(subset = ['id']).sum()

0

In [9]:
df_trump.set_index(['id'], inplace = True)

In [10]:
df_trump.head(2)

Unnamed: 0_level_0,link,content,date,retweets,favorites,mentions,hashtags
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,


In [11]:
df_trump.isnull().sum()

link             0
content          0
date             0
retweets         0
favorites        0
mentions     22966
hashtags     37769
dtype: int64

**Note_3:** 'mentions', 'link' and 'hashtags' are not relevant for my analysis, I will take care of those and delete.
            the other columns are good to go. 

In [12]:
df_trump.drop(['mentions', 'hashtags', 'link'], axis=1, inplace = True)

In [13]:
df_trump.head(2)

Unnamed: 0_level_0,content,date,retweets,favorites
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1698308935,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917
1701461182,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267


**Note_4:** WAIT A MINUTE!! I can´t work with the date like this, need to split it.
I will create one column for each of the variables I want:
- Year
- Month
- Day
- Hour
- Minute

Why? you might ask. Well, what I will do is analyze by market times, so it is important to know the day, time, so take out the moments the market was closed. Seconds is not relevant! Let´s go!

In [14]:
#using datetime method!! I am creating new columns as stated above

#as you can see at the end of each line, there is the "year", "month", etc.. 

df_trump['year'] = df_trump['date'].dt.year
df_trump['month'] = df_trump['date'].dt.month
df_trump['day'] = df_trump['date'].dt.day
df_trump['hour'] = df_trump['date'].dt.hour
df_trump['minute'] = df_trump['date'].dt.minute

In [15]:
df_trump.head(2)

Unnamed: 0_level_0,content,date,retweets,favorites,year,month,day,hour,minute
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1698308935,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,2009,5,4,13,54
1701461182,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,2009,5,4,20,0


**Note_4.1:** This looks great, but now I dont really want the 'date' column, so I will drop it.

In [16]:
df_trump.drop(['date'], axis=1, inplace = True)

In [17]:
df_trump.head(2)

Unnamed: 0_level_0,content,retweets,favorites,year,month,day,hour,minute
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1698308935,Be sure to tune in and watch Donald Trump on L...,510,917,2009,5,4,13,54
1701461182,Donald Trump will be appearing on The View tom...,34,267,2009,5,4,20,0


**Note_4:** I love how the date looks now and split, but I want to check their types!!

In [18]:
df_trump.dtypes

content      object
retweets      int64
favorites     int64
year          int64
month         int64
day           int64
hour          int64
minute        int64
dtype: object

In [19]:
df_trump.isnull().sum()

content      0
retweets     0
favorites    0
year         0
month        0
day          0
hour         0
minute       0
dtype: int64

**AMAZING** this is what I wanted to see and use.

**Let´s save the file**

In [20]:
df_trump.to_csv('../data/trumpnew.csv')

**SOME BONUS CONCLUSIONS**

    Note these are not relevant for overall analysis. 

In [21]:
df_trump.stb.freq(['month','year'])

Unnamed: 0,month,year,count,percent,cumulative_count,cumulative_percent
0,1,2015,1145,2.641170,1145,2.641170
1,2,2013,926,2.136003,2071,4.777173
2,4,2013,912,2.103709,2983,6.880882
3,3,2013,871,2.009135,3854,8.890017
4,1,2013,814,1.877653,4668,10.767669
...,...,...,...,...,...,...
129,2,2010,4,0.009227,43340,99.972320
130,1,2010,4,0.009227,43344,99.981546
131,11,2009,3,0.006920,43347,99.988467
132,9,2009,3,0.006920,43350,99.995387


In [22]:
df_trump.stb.freq(['year','month'])

Unnamed: 0,year,month,count,percent,cumulative_count,cumulative_percent
0,2015,1,1145,2.641170,1145,2.641170
1,2013,2,926,2.136003,2071,4.777173
2,2013,4,912,2.103709,2983,6.880882
3,2013,3,871,2.009135,3854,8.890017
4,2013,1,814,1.877653,4668,10.767669
...,...,...,...,...,...,...
129,2010,1,4,0.009227,43340,99.972320
130,2009,10,4,0.009227,43344,99.981546
131,2009,11,3,0.006920,43347,99.988467
132,2009,9,3,0.006920,43350,99.995387
