# Table of Contents
 <p><div class="lev1"><a href="#Task-1.-Compiling-Ebola-Data"><span class="toc-item-num">Task 1.&nbsp;&nbsp;</span>Compiling Ebola Data</a></div>
 <div class="lev1"><a href="#Task-2.-RNA-Sequences"><span class="toc-item-num">Task 2.&nbsp;&nbsp;</span>RNA Sequences</a></div>
 <div class="lev1"><a href="#Task-3.-Class-War-in-Titanic"><span class="toc-item-num">Task 3.&nbsp;&nbsp;</span>Class War in Titanic</a></div></p>

In [6]:
import pandas as pd
import os
import glob
import numpy as np

# Fake news term usage analysis

For this task we decided to consider three different period of time in order to see when the term "fake news" arise. We took the campaign period, the president-elect period and the presidency period. First of all we create a dataframe for each of these periods. Below you can see the helpers function that we used:

In [7]:
def load_data():
    """
    Loading all the data in one dictionary and two lists: condensed and master and returning them.
    You can access the json file from the dictionary
    using the file name without the .json extension.
    E.g.: all_data["condensed_2009"]
    """

    trump_tweets = glob.glob("trump_tweets/*.json")

    all_data = {}
    condensed = []
    master = []

    for json_file in trump_tweets:

        file = pd.read_json(json_file)
        all_data[os.path.basename(json_file).replace(".json", "")] = file

        if "master" in os.path.basename(json_file):
            master.append(file)
        else:
            condensed.append(file)

    return all_data, condensed, master



def select_time_interval(df, date_column, start_datetime, end_datetime):
    """
    returns a dataframe selected by a specific period of time
    """
    return df[(df[date_column] >= start_datetime) & (df[date_column] <= end_datetime)]

- Here we create the three dataframes:

In [8]:
# retrieving all data from Trump's tweets dataset
all_data, condensed, master = load_data()

# getting the condensed version for year 2016 and 2017
condensed_2016 = all_data["condensed_2016"]
condensed_2017 = all_data["condensed_2017"]

# creating a dataframe for campaign period
cond_US_campaign_2016 = select_time_interval(condensed_2016, 'created_at',
                                             np.datetime64('2016-02-01'), np.datetime64('2016-11-08'))

cond_US_campaign_2016 = cond_US_campaign_2016.sort_values('created_at')


# creating a dataframe for president elect period
cond_pres_elect_df = select_time_interval(condensed_2016, 'created_at',
                                          np.datetime64('2016-11-09'), np.datetime64('2016-12-31'))

cond_pres_elect_df_2017 = select_time_interval(condensed_2017, 'created_at',
                                               np.datetime64('2017-01-01'), np.datetime64('2017-01-20'))

cond_pres_elect_df = cond_pres_elect_df.append(cond_pres_elect_df_2017)

cond_pres_elect_df = cond_pres_elect_df.sort_values('created_at')



# creating a dataframe for presidency period
cond_president_period_df = select_time_interval(condensed_2017, 'created_at',
                                                np.datetime64('2017-01-20'), np.datetime64('2017-11-05'))

cond_president_period_df = cond_president_period_df.sort_values('created_at')

### The campaign period

- Now we will search in each tweets for the term "fake news". We used a simple regex and we do not consider case sensitive. We create a new column of boolean values 'fake_news_used' in our dataframe using the 'contains' method. After this we create useful columns with Month, week/year and day values in order to make a groupby with them and see some interesting patterns in the usage of the term.

In [9]:
# creating the column with boolean values for the matches of the regex:
cond_US_campaign_2016['fake_news_used'] = cond_US_campaign_2016['text'].str.contains('fake news|fakenews', case=False)

# creating the columns 'Month', 'week/year' and 'Date'
cond_US_campaign_2016['Month'] = cond_US_campaign_2016['created_at'].dt.month
cond_US_campaign_2016['week/year'] = cond_US_campaign_2016['created_at'].apply(lambda x: "%d/%d" % (x.week, x.year))
cond_US_campaign_2016['Date'] = cond_US_campaign_2016['created_at'].dt.date


- Now first we count how many positive results we had:

In [10]:
cond_US_campaign_2016['fake_news_used'].sum()

0

As you can see, during his campaign period, the 'fake news' did not appear in any of his tweets. In order to be sure we search for the words 'fake' and 'news' separately and these are the results:

In [48]:
# search for the word fake
match_df = cond_US_campaign_2016.loc[:,['text']]
match_df['fake_usage'] = cond_US_campaign_2016['text'].str.contains('fake', case=False)
temp_df = match_df[match_df['fake_usage'] == True]

print(str(temp_df.text.values))
temp_df


[ '"@ddpick18: @realDonaldTrump This Texan will be voting Trump March 1st. Cruz is a fake Texan!"'
 '@elizabethforma Goofy Elizabeth Warren, sometimes referred to as Pocahontas because she faked the fact she is native American, is a lowlife!'
 '"@JimVitari:  @ABC @washingtonpost we know they\'re fake just like poles during primary. I\'m sure u will crush #CrookedHillary in general"'
 '"@brazosboys: Hillary read "sigh" off the Teleprompter, She\'s so fake she has to be told how to feel: https://t.co/ENXliW2m77 @FoxNews']


Unnamed: 0,text,fake_usage
3349,"""@ddpick18: @realDonaldTrump This Texan will b...",True
2228,"@elizabethforma Goofy Elizabeth Warren, someti...",True
1898,"""@JimVitari: @ABC @washingtonpost we know the...",True
1897,"""@brazosboys: Hillary read ""sigh"" off the Tele...",True


In [49]:
# search for the word 'news'
match_df = cond_US_campaign_2016.loc[:,['text']]
match_df['news_usage'] = cond_US_campaign_2016['text'].str.contains('news', case=False)
temp_df = match_df[match_df['news_usage'] == True]

# showing the results
temp_df.head()

Unnamed: 0,text,news_usage
3705,I will be interviewed on @greta at 7:00 P.M. E...,True
3601,"Dopey Mort Zuckerman, owner of the worthless @...",True
3600,"Worthless @NYDailyNews, which dopey Mort Zucke...",True
3599,"Like the worthless @NYDailyNews, looks like @p...",True
3591,There are no buyers for the worthless @NYDaily...,True


In [51]:
# number of positive matches:
temp_df.news_usage.sum()

147

For the 'news' term we found mostly tweets with hashtags of media or other tweets that are related with 'fake news' term.
Therefore, we conlcude what we have said before, in the campaign period there is no sign of the 'fake news' term in his tweets. We can go ahead with the president-elect period.

### President elect period

- Here we repeat the same process that we have done for the campaign period:

In [37]:
# creating the column with boolean values for the matches of the regex:
cond_pres_elect_df['fake_news_used'] = cond_pres_elect_df['text'].str.contains('fake news|fakenews', case=False)

# creating the columns 'month', 'week/year' and 'date'
cond_pres_elect_df['month'] = cond_pres_elect_df['created_at'].dt.month
cond_pres_elect_df['week/year'] = cond_pres_elect_df['created_at'].apply(lambda x: "%d/%d" % (x.week, x.year))
cond_pres_elect_df['date'] = cond_pres_elect_df['created_at'].dt.date

# showing the number of positive matches:
cond_pres_elect_df['fake_news_used'].sum()

11

- We have few positive matches, let's continue with the analysis:
- first we display all the tweets:

In [41]:
match_df = cond_pres_elect_df.loc[cond_pres_elect_df['fake_news_used'] == True, ['date','text']]
match_df = match_df.sort_values('date')
from IPython.display import display
print(str(match_df.text.values))
match_df


[ 'Reports by @CNN that I will be working on The Apprentice during my Presidency, even part time, are ridiculous &amp; untrue - FAKE NEWS!'
 'FAKE NEWS - A TOTAL POLITICAL WITCH HUNT!'
 'RT @MichaelCohen212: I have never been to Prague in my life. #fakenews https://t.co/CMil9Rha3D'
 "'BuzzFeed Runs Unverifiable Trump-Russia Claims' #FakeNews \nhttps://t.co/d6daCFZHNh"
 'I win an election easily, a great "movement" is verified, and crooked opponents try to belittle our victory with FAKE NEWS. A sorry state!'
 'Intelligence agencies should never have allowed this fake news to "leak" into the public. One last shot at me.Are we living in Nazi Germany?'
 "We had a great News Conference at Trump Tower today. A couple of FAKE NEWS organizations were there but the people truly get what's going on"
 '.@CNN is in a total meltdown with their FAKE NEWS because their ratings are tanking since election and their credibility will soon be gone!'
 'Totally made up facts by sleazebag political operative

Unnamed: 0,date,text
91,2016-12-10,Reports by @CNN that I will be working on The ...
2123,2017-01-11,FAKE NEWS - A TOTAL POLITICAL WITCH HUNT!
2122,2017-01-11,RT @MichaelCohen212: I have never been to Prag...
2121,2017-01-11,'BuzzFeed Runs Unverifiable Trump-Russia Claim...
2118,2017-01-11,"I win an election easily, a great ""movement"" i..."
2117,2017-01-11,Intelligence agencies should never have allowe...
2116,2017-01-12,We had a great News Conference at Trump Tower ...
2112,2017-01-12,.@CNN is in a total meltdown with their FAKE N...
2108,2017-01-13,Totally made up facts by sleazebag political o...
2090,2017-01-16,"much worse - just look at Syria (red line), Cr..."


- We noticed that in this 11 tweets the fake news term was related mostly with CNN and Russia. Specially the rise of the term was due to release of a non-verified paper by Buzzfeed containing strong claims about Trump and Russia ties and possibility that Trump could be blackmailed by the Russian governement. He over reacted to this leak, attacking the intelligence agencies and trying to discredit the Media. For more information, here you have the story described by the New York Times: https://www.nytimes.com/2017/01/10/business/buzzfeed-donald-trump-russia.html


- Due to the small number of tweets with the 'fake news' term we can go ahead with the next period.

## Presidency period

Again, we prepare the dataframe creating the column for the match:

In [42]:
# creating the column with boolean values from the regex result:
cond_president_period_df['fake_news_used'] = cond_president_period_df['text'].str.contains('fake news|fakenews', case=False)

# showing the number of positive results:
cond_president_period_df['fake_news_used'].sum()


132

- Considered that we have a period of 10 month, we have an interesting number of postive matches, so we can start a deeper analysis conidering the months, the week/year and by date:

In [43]:
# adding month, date and week/year columns in order to make groupby operations
cond_president_period_df['Month'] = cond_president_period_df['created_at'].dt.month
cond_president_period_df['date'] = cond_president_period_df['created_at'].dt.date
cond_president_period_df['week/year'] = cond_president_period_df['created_at'].apply(lambda x: "%d/%d" % (x.week, x.year))


### analysis by month

In [44]:
## still have to add the analysis

### analysis by week/year

In [None]:
## still have to add the analysis

### analysis by date

In [None]:
## still have to add the analysis

# Washington Post dataset scraping

In [None]:
## still have to add the analysis

### Merge with our dataset

In [45]:
## still have to add the analysis

## Analysis of source usage (android iPhone)

In [None]:
## still have to add the analysis