# Machine Learning Project

## Introduction

After taking the Data Science course from BrainStation, I learned how powerful python and its libraries in data analysis, prediction and visualization. In order to strengen my skills in algorithm development, I decided to take a step further to continue and deep dive into Machine Learning. 

Currently I am at the 3rd week of the course and planning on my project, it took me a while to decide a topic as I tried to avoid having repetitive skills as used in Data Science course. I learned that there are various stages of Data Science and Machine Learning:
 - Data collection
 - Data sorting
 - Data analysis
 - Algorithm development
 - Apply and optimize algorithm
 - Utilize the results geneterated from the algorithm to provide insights and further conclusions

In order to apply all the above skills and use all sort of machine learning APIs more extensively including sci-kit learn, TensorFlow and Keras, I have done research and read feedback on suggested projects, and decided to proceed with analyzing Real and Fake news. Data was taken from Kaggle and the project is in progress. I will constantly upload the noteboook for updates, so stay tuned!

## Data Collection

**Data Source:** Data is collected from Kaggle https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

There are 2 sets of data - **True.csv** and **Fake.csv**, which consists of ~20,000 and ~18,000 articles, respectively. Each dataset contains the following information:

 - Article Title
 - Article Test
 - Article Subject
 - Article published date

Data was cleaned and uploaded to Kaggle by the creator prior to downloading for analysis 

## Import Libraries and Data

In [2]:
# import libraries
# more libraries will be imported as I go

# libraries for data processing and data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

%matplotlib inline

# visual representation of text data
from wordcloud import WordCloud

# libraries for machine learning


# Traing and Test splits prior to data modelling
from sklearn.model_selection import train_test_split

# Various models for data modelling
from sklearn.linear_model import LinearRegression

# Model Scoring
from sklearn.metrics import accuracy_score

In [40]:
# import data
true = pd.read_csv('data/True.csv')
true.tail()

Unnamed: 0,title,text,subject,date
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017"
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017"
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017"
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017"
21416,Indonesia to buy $1.14 billion worth of Russia...,JAKARTA (Reuters) - Indonesia will buy 11 Sukh...,worldnews,"August 22, 2017"


In [41]:
fake = pd.read_csv('data/Fake.csv')
fake.tail()

Unnamed: 0,title,text,subject,date
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016"
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016"
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016"
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016"
23480,10 U.S. Navy Sailors Held by Iranian Military ...,21st Century Wire says As 21WIRE predicted in ...,Middle-east,"January 12, 2016"


## Exploratory data analysis

#### Create a copy of both DataFrames to avoid changes of the original DataFrames, followed by exploring DataFrames details

In [32]:
df_true = true.copy()

In [33]:
df_true.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
dtypes: object(4)
memory usage: 669.4+ KB


In [34]:
df_true.nunique()

title      20826
text       21192
subject        2
date         716
dtype: int64

In [35]:
df_true['subject'].unique()

array(['politicsNews', 'worldnews'], dtype=object)

In [11]:
df_fake = fake.copy()

In [12]:
df_fake.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
dtypes: object(4)
memory usage: 733.9+ KB


In [13]:
df_fake.nunique()

title      17903
text       17455
subject        6
date        1681
dtype: int64

In [14]:
df_fake['subject'].unique()

array(['News', 'politics', 'Government News', 'left-news', 'US_News',
       'Middle-east'], dtype=object)

#### Data cleanup on the subject column to make it more readible and for consistency

In [36]:
# change the content in subject column for consistency: create dictionary and assign to the column

# For Ture News
true_subject_clean = {'politicsNews' : 'Political News',\
                        'worldnews' : 'World News'}

df_true['subject'] = df_true['subject'].map(true_subject_clean)

# For Fake News
fake_subject_clean = {'politics' : 'Political News',\
                        'Government News' : 'Government News',\
                        'News' : 'News',\
                        'left-news' : 'Left News',\
                        'US_News' : 'US News',\
                        'Middle-east' : 'Middle Eastern News'}

df_fake['subject'] = df_fake['subject'].map(fake_subject_clean)