# Data Merging

What if we find more data that we would like to add to our analysis? This notebook quickly works through some of the considerations you'll have to make when merging datasets. Primarily, in order to merge different datasets with different data fields, there are two general methods. With one method, we might keep every column we encounter. If a dataset lacks that particular column, we could keep the column but fill the rows with NA values. Another method, the one we'll use, is to keep only the column we have values for in the entire dataset.

In [1]:
import re
import pandas as pd

abs_dir = "/Users/williamquinn/Desktop/DH/Python/Teaching/Python-Notebooks/"

### Loading Old Dataframe

In [2]:
data = pd.read_csv(abs_dir + "data/fake-and-real-news-dataset/dataframe.csv",
                   sep = ",")

data.head()

Unnamed: 0,title,text,subject,date,veracity
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,2017-12-31,real
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,2017-12-29,real
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,2017-12-31,real
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,2017-12-30,real
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,2017-12-29,real


### Loading New Dataframe

In [3]:
new_data = pd.read_csv(abs_dir + "data/fake-news-dataset-2/fake.csv",
                       sep = ",")

print (new_data.shape)
new_data.head(2)

(12999, 20)


Unnamed: 0,uuid,ord_in_thread,author,published,title,text,language,crawled,site_url,country,domain_rank,thread_title,spam_score,main_img_url,replies_count,participants_count,likes,comments,shares,type
0,6a175f46bcd24d39b3e962ad0f29936721db70db,0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,english,2016-10-27T01:49:27.168+03:00,100percentfedup.com,US,25689.0,Muslims BUSTED: They Stole Millions In Gov’t B...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
1,2bdc29d12605ef9cf3f09f9875040a7113be5d5b,0,reasoning with facts,2016-10-29T08:47:11.259+03:00,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,english,2016-10-29T08:47:11.259+03:00,100percentfedup.com,US,25689.0,Re: Why Did Attorney General Loretta Lynch Ple...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias


### Making Column Adjustments

We can immediately tell that this new dataset had different columns. While this data may be interesting to include in another analysis, we'll want to keep only the columns that are also in the old dataframe.

In [4]:
# We'll first select only the columns to keep.
# And rename the published column name to date.
new_data = new_data[['title', 'text', 'published']] \
    .rename(columns = {'published':'date'})

# The date field is in a datetime but a different format. We'll use a regex to keep only year-month-day.
# The \\1 is our first group. 
# The group (\\1) returns only the regex that appears within the parenthesis.
new_data['date'] = new_data['date'].replace(r'(\d{4}-\d{2}-\d{2}).*', '\\1', regex = True)

# We'll want the date field to be in the same format as the old dataframe, too.
new_data['date'] = pd.to_datetime(new_data['date'], format = '%Y-%m-%d')

# We also know that these articles are fake.
new_data['veracity'] = 'fake'

# While we don't know the "subject" of each entry, we can create our own value.
new_data['subject'] = 'fake_news'

# It could also be useful to keep track of which reports come from which dataset.
data['dataset'] = 0
new_data['dataset'] = 1

new_data.head()

Unnamed: 0,title,text,date,veracity,subject,dataset
0,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,2016-10-26,fake,fake_news,1
1,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,2016-10-29,fake,fake_news,1
2,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,2016-10-31,fake,fake_news,1
3,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,2016-11-01,fake,fake_news,1
4,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,2016-11-01,fake,fake_news,1


### Merging Old and New Dataframes

In [5]:
data = pd.concat([data, new_data], sort=False)

# Sometimes, text has integers and floats. 
# In order to run functions later, we'll want to ensure these values are strings.
data['text'] = data['text'].astype(str)

print (data.shape)
data.head()

(57887, 6)


Unnamed: 0,title,text,subject,date,veracity,dataset
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,2017-12-31,real,0
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,2017-12-29,real,0
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,2017-12-31,real,0
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,2017-12-30,real,0
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,2017-12-29,real,0


## Save Complete Dataset

We'll now save the entire dataset. Because we're joining the data from two different locations on our computers (subdirectories), I'm going to save this in a new location (one directory above each of these).

In [6]:
data.to_csv(abs_dir + "data/dataframe.csv",  
            index=False)

## Reading Text Fields

Before we finish, though, let's read through some of the articles to look for any other issues we might want to clean. We'll look at two articles from the same day to see how they compare.

### Real News Report

In [11]:
data.query('(veracity == "real") & (date == "2017-10-05")')['text'].values[0]

'WASHINGTON (Reuters) - U.S. Republican Representative Tim Murphy will resign from Congress on Oct. 21, House Speaker Paul Ryan said on Thursday, following a report alleging that Murphy had asked a woman with whom he was having an affair to get an abortion. Murphy had said in a statement on Wednesday he would not seek re-election next year. The lawmaker had been a member of the Congressional Pro-Life Caucus, once receiving a 92 percent score from the conservative Family Research Council, which opposes abortion. There was no immediate response from Murphy’s office for request for comment on Thursday. The Pittsburgh Post-Gazette, citing a Jan. 25 text message, said the woman had chastised Murphy for asking her to get an abortion during a pregnancy scare despite his office posting an anti-abortion statement on Facebook. According to the newspaper, Murphy texted her in response: “I get what you say about my March for life messages. I’ve never written them. Staff does them. I read them and 

### Fake News Report

In [12]:
data.query('(veracity == "fake") & (date == "2017-09-09")')['text'].values[0]

'Two Florida Republican lawmakers voted against a $15 billion hurricane relief bill just as Irma churned toward the state. Congressmen Matt Gaetz (R-FL) and Ted Yoho (R-FL) claim they have concerns about other provisions of the measure. Irma is now a Category 3 storm with maximum sustained winds of 125 mph but forecasters expect the storm to strengthen. Presently, the Florida Keys is facing a potentially catastrophic force that could threaten to drown entire islands.But, it s obviously more important for Gaetz and Yoho to exercise their conservative bonafides.The relief package sailed through the Senate and the House and was signed by Donald Trump on Friday. The package boosts funding for the Federal Emergency Management Agency (FEMA), a necessary move at this time following the destruction left by Hurricane Harvey and much more expected after Florida feels Irma s force and fury. Also, FEMA is running out of money so the funds are needed.The package will also raise the debt ceiling for