# Natural Lanuguage Processing
## Fake news identification

by Daniel Russotto and Christine Utendorf

In this assignment our goal is to determine if a provided article including a title and text provides real or fake news. Fake news consists of disinformation and it imposes a great threat to our today's society. Not knowing what to believe when it comes to news and information or even worse not recognizing that the information provided is not reflecting the truth can truely harm a reader. Especially through the rise of the internet and with it the rise of social media, news can be accessed at any time, any place and from many different sources. However, this also gives fake news the possibility to spread faster, wider and more successfully.

Social media giant Facebook has set up a unit to identify such fake news on its platform. After being critized more than ones for doing to little against the spread of false infromation, Facebook is now "working to stop misinformation and false news". The company is not only working together with third-party fact-check organizations but is also applying machine learning techniques to identify such post that contain fake news (see more under the [link](https://www.facebook.com/facebookmedia/blog/working-to-stop-misinformation-and-false-news)). It is highly likely that Facebook uses Natural Language Processing and classification algorithms in order to determine if they have fraud in front of them or not.

In this assignment we (Dan and Christine) are going to work on such a problem that Facebook (as well as Twitter, Youtube, and many other platforms) is facing everyday: Identifying is an article provides real or fake news. We are not going to use fact-checking in order to prove if an information is acurate, but train a machine learning algorithm to classify articles as fake or real. In order to do so we are using several concepts of natural language processing such as tokenization and lemmatization from the NLTK python library as well as machine learning concepts including logistic regression and naive bayes from the sklearn python library. For this puprpose we were provided with a training data set that includes articels that are already labled as real or fake and a test set without such labels. The goal is to train a model that is able to find a general pattern to identify fake news among articles it has never seen before (here our "blind" test data set).

### Library loading

In [1]:
import pandas as pd 
import numpy as np
import nltk
nltk.download('wordnet')
nltk.download('punkt')
import sklearn

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Dan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Data sets

In [2]:
train_data = pd.read_csv("data/fake_or_real_news_training.csv", quotechar='"', header=0, sep=",",
                    index_col="ID", encoding='utf-8')
test_data = pd.read_csv("data/fake_or_real_news_test.csv", quotechar='"', header=0, sep=",",
                   index_col="ID", encoding='utf-8')

#### Train data set
The training data set has 3,999 rows each representing an article. All articles are identified with a unique id, have a title, text and a label if they are fake or real. Furthermore there are the columns X1 and X2. These two columns should actually be all filled with NaN (=empty). However, 33 rows show values in X1 and 2 of these 33 also in X2. This shows that the text was not properly splitted in these cases. The separator used to splitt the csv into a dataframe is "," but as it seems in some of the cases this did not split all rows correctly. In the data cleaning part we will take a closer look at this.

In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3999 entries, 8476 to 9673
Data columns (total 5 columns):
title    3999 non-null object
text     3999 non-null object
label    3999 non-null object
X1       33 non-null object
X2       2 non-null object
dtypes: object(5)
memory usage: 187.5+ KB


In [4]:
train_data.head(5)

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,,
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,,
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,,
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,,
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,,


#### Test data
The test data set consists of 2321 unlabeled rows and due to the fact that the data frame only includes the unique id, the title and the actual text it seems as the text split worked well here (no X1 or X2). The train data has less than double the amount of articles in it compared to the test data set. This makes it crucial to retrain at the end the machine learning with the complete train data set due to limited data availability.

In [5]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2321 entries, 10498 to 4330
Data columns (total 2 columns):
title    2321 non-null object
text     2321 non-null object
dtypes: object(2)
memory usage: 54.4+ KB


In [6]:
test_data.head(5)

Unnamed: 0_level_0,title,text
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
10498,September New Homes Sales Rise——-Back To 1992 ...,September New Homes Sales Rise Back To 1992 Le...
2439,Why The Obamacare Doomsday Cult Can't Admit It...,But when Congress debated and passed the Patie...
864,"Sanders, Cruz resist pressure after NY losses,...",The Bernie Sanders and Ted Cruz campaigns vowe...
4128,Surviving escaped prisoner likely fatigued and...,Police searching for the second of two escaped...
662,Clinton and Sanders neck and neck in Californi...,No matter who wins California's 475 delegates ...


### 1.2 Data cleaning
As seen in the train_data.info() we have several rows that were not correctly put in the dataframe. We are now going to fix these rows. The csv is splitted using commas. Thus a title or text that has commas in it is splitted incorrectly. A first step is to filter out the rows that are displaced and take a closer look at them. Overall, we have 33 rows with displaced values since all the rows that have values in X2 have values in X1. It is important to fix these values due to the limited number of training rows.

In [7]:
displaced_rows = train_data[train_data.X1.notnull()]
displaced_rows

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
599,Election Day: No Legal Pot In Ohio,Democrats Lose In The South,Election Day: No Legal Pot In Ohio; Democrats ...,REAL,
10194,Who rode it best? Jesse Jackson mounts up to f...,Leonardo DiCaprio to the rescue?,Who rode it best? Jesse Jackson mounts up to f...,FAKE,
356,Black Hawk crashes off Florida,human remains found,(CNN) Thick fog forced authorities to suspend ...,REAL,
2786,Afghanistan: 19 die in air attacks on hospital,U.S. investigating,(CNN) Aerial bombardments blew apart a Doctors...,REAL,
3622,Al Qaeda rep says group directed Paris magazin...,US issues travel warning,A member of Al Qaeda's branch in Yemen said Fr...,REAL,
7375,Shallow 5.4 magnitude earthquake rattles centr...,shakes buildings in Rome,00 UTC © USGS Map of the earthquake's epicent...,FAKE,
9097,ICE Agent Commits Suicide in NYC,Leaves Note Revealing Gov’t Plans to Round-up...,Email Print After writing a lengthy suicide no...,FAKE,
9203,Political Correctness for Yuengling Brewery,What About Our Opioid Epidemic?,We Are Change \r\n\r\nIn today’s political cli...,FAKE,
1602,Poll gives Biden edge over Clinton against GOP...,VP meets with Trumka,A new national poll shows Vice President Biden...,REAL,
4562,Russia begins airstrikes in Syria,U.S. warns of new concerns in conflict,Russian warplanes began airstrikes in Syria on...,REAL,


Overall, it seems that the problem lies within the title. For the 31 cases that have only one misplacement (X2 = NaN), the title was splitted into two causing that the actual label is within the X1 column and the article text in the label column. For the double splitted row with index 9, it actually seems that in the title were two commas leading to a double split. However the row with index 6268 repeats the wrongly splitted phrase just again in the label and X1. There is no sign of further text and thus this row should be excluded since it does not provide an actual article text.
However, lets start with the rows that have one wrong column break by joining the title and the text field back together into the full title and then replace the text and label column with the actual values.

In [8]:
train_data.loc[train_data.X1.notnull(), 'title'] = train_data.loc[train_data.X1.notnull(), 'title'] + train_data.loc[train_data.X1.notnull(), 'text']
train_data.loc[train_data.X1.notnull(), 'text'] = train_data.loc[train_data.X1.notnull(), 'label']
train_data.loc[train_data.X1.notnull(), 'label'] = train_data.loc[train_data.X1.notnull(), 'X1']
train_data.loc[train_data.X1.notnull(), 'X1'] = train_data.loc[train_data.X1.notnull(), 'X2']

After fixing the line break for the first 31 rows, we are now gonna look at the last two displaced rows. While the row with index 9 seems to be eadsy to fix, the row with id 6268 does not seem to actually have text. Thus we are gonna fix row 9 and exclude tow 6268.

In [9]:
displaced_rows = train_data[train_data.X1.notnull()]
displaced_rows

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
9,Planned Parenthood’s lobbying effort pay raise...,and the future Fed rates,PLANNED PARENTHOOD’S LOBBYING GETS AGGRESSIVE....,REAL,REAL
6268,Chart Of The Day: Since 2009—–Recovery For The...,Chart Of The Day: Since 2009 Recovery For The 5%,Stagnation for the 95%,FAKE,FAKE


#### Fix row 9

In [10]:
train_data.loc[9, 'title'] = train_data.loc[9, 'title'] + "," + train_data.loc[9, 'text']
train_data.loc[9, 'text'] = train_data.loc[9, 'label']
train_data.loc[9, 'label'] = train_data.loc[9, 'X1']
train_data.loc[9, 'X1'] = np.nan
train_data.loc[9, 'X2'] = np.nan

#### Exclude row 6268

In [11]:
# Don't want to remove anymore
train_data = train_data[train_data.X2.notnull() == False]

train_data["text"] = train_data.apply(lambda row: row["text"].strip(), axis=1).replace("", "zzzzz")

train_data[train_data["text"] == "zzzzz"]
#train_data.loc[6268]

#train_data.loc[5530]["text"].strip()
# train_data.apply(lambda row: row["text"].strip(), axis=1).loc[5530]

Unnamed: 0_level_0,title,text,label,X1,X2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
5530,The Arcturian Group by Marilyn Raffaele Octobe...,zzzzz,FAKE,,
8332,MARKETWATCH LEFTIST: MSM’s “Blatant” Anti Trum...,zzzzz,FAKE,,
9314,Southern Poverty Law Center Targets Anti-Jihad...,zzzzz,FAKE,,
10304,Refugee Resettlement Watch: Swept Away In Nort...,zzzzz,FAKE,,
9474,Michael Bloomberg Names Technological Unemploy...,zzzzz,FAKE,,
5802,Alert News : Putins Army Is Coming For World W...,zzzzz,FAKE,,
9564,An LDS Reader Takes A Look At Trump Accuser Je...,zzzzz,FAKE,,
5752,America’s Senator Jeff Sessions Warns of Worse...,zzzzz,FAKE,,
8816,Paris Migrant Campers Increase after Calais Is...,zzzzz,FAKE,,
7525,Putins Army is coming for World war 3 against ...,zzzzz,FAKE,,


In [12]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3998 entries, 8476 to 9673
Data columns (total 5 columns):
title    3998 non-null object
text     3998 non-null object
label    3998 non-null object
X1       0 non-null object
X2       0 non-null object
dtypes: object(5)
memory usage: 187.4+ KB


After fixing the displaced rows we can now delete the columns X1 and X2.

In [13]:
train_data = train_data[['title', 'text','label']]
train_data.head()

Unnamed: 0_level_0,title,text,label
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


### 1.2 Data exploration

#### Target distribution
Since we are working on a classification problem, it is important to look at the target distribution. Highly imbalanced targets need resampling methods in order to train a well-working machine learning model. Thus our first step in terms of data exploration is to check the amount of fake and real labels within our data set:

In [14]:
train_data[train_data.label == "REAL"].count()

title    2008
text     2008
label    2008
dtype: int64

In [15]:
train_data[train_data.label == "FAKE"].count()

title    1990
text     1990
label    1990
dtype: int64

The target is almost equally distributed with 2008 real and 1990 fake articles.

#### Real news vs. fake news

<font color=red>

Set language to English

Pos-tagging text and title
    
- Common words text
- Common words title

- Common words text (w/o stopwords, lowercase)
- Common words title (w/o stopwords, lowercase)

- Distribution of word classes overall (Adj, noun, ...)

- length of title
- length of text

- lexical diversity

</font>

## 2. Text preparation
<font color=red>


1. Lemmatizing
2. Stemming

again lexical diversity

Creating different data sets for modelling

- Data option 1 for modelling: No stopwords, no punctuation
- Data option 2 for modelling: Bi-grams
- Data option 3 for modelling: Tri-grams

(for 2 & 3 stopwords & punctuation are needed)


</font>

Clean_title: lowercased, with stopwords and with punctuation. Lemmatized and then stemmed

Clean_text: lowercased, with stopwords and with punctuation. Lemmatized and then stemmed 

Clean_combined: lowercased, with stopwords and with punctuation. Lemmatized and then stemmed 

To create lexical diversity, use clean_combined and then remove stopwords and punctuation

Use all (but not at the same time) for vectorization(unigrams, bigrams, TF-IDF, etc)

In [27]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from functools import reduce
from nltk import word_tokenize, WordPunctTokenizer
from nltk.corpus import wordnet, stopwords
import re

# is this really a good idea?
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN #by default is noun

def postag_text(txt, rm_stopwords=True, lowertext=True, rm_punct=True):
    if rm_punct:
        txt = re.sub("[^a-zA-Z]", " ", txt)
        
    tokenized = word_tokenize(txt)
    if lowertext:
        tokenized = list(map(lambda word: word.lower(), tokenized))
        
    if rm_stopwords:
        sw = set(stopwords.words('english'))
        tokenized = list(filter(lambda word: word not in sw, tokenized))
        
    tags = nltk.pos_tag(tokenized)
    return list(map(lambda tag_info: (tag_info[0], get_wordnet_pos(tag_info[1])), tags))
    


def lemmatize_text(text, lowertext=True, rm_punct=False):
    wnlt = WordNetLemmatizer()
    
    tagged = postag_text(text, rm_stopwords=False, rm_punct=rm_punct)
    
    lemmatized_list = list(map(lambda tag: wnlt.lemmatize(tag[0], pos=tag[1]), tagged))
    
    lemmatized_text = reduce(lambda x, y: x + " " + y, lemmatized_list)
    
    return lemmatized_text
    
print(lemmatize_text("Dogs run, he runs, she is drinking, people are sitting"))

def stem_text(text, lowertext=True, rm_punct=False):
    pst = PorterStemmer()
    
    if rm_punct:
        text = re.sub("[^a-zA-Z]", " ", text)
    
    tokenized = word_tokenize(text)
    
    if lowertext:
        tokenized = list(map(lambda word: word.lower(), tokenized))
    
    stemmed = reduce(lambda x, y: x + " " + y, list(map(lambda word: pst.stem(word), tokenized)))
    
    return stemmed

def stem_lemmatize_text(text, lowertext=True, rm_punct=False):
    lemmatized = lemmatize_text(text, lowertext=lowertext, rm_punct=rm_punct)
    return stem_text(lemmatized, lowertext=lowertext, rm_punct=rm_punct)


print(stem_text("Dogs run, he runs, she is drinking, people are sitting"))
print(stem_lemmatize_text("Dogs run, he runs, she is drinking, people are sitting"))


dog run , he run , she be drink , people be sit
dog run , he run , she is drink , peopl are sit
dog run , he run , she be drink , peopl be sit


In [37]:
def clean_col(df, col):
    return df.apply(lambda row: stem_lemmatize_text(row[col]), axis=1).to_frame(name="clean_{}".format(col))
    
def combine_cols(col1, col2, name):
    new_series = col1 + "." + col2
    return new_series.to_frame(name=name)


#pd.Series(["a", "b", "c"]) + pd.Series(["1", "2", "3"])

#combine_cols(pd.Series(["a", "b", "c"]),pd.Series(["1", "2", "3"]), "boop")



In [29]:
import datetime

print(datetime.datetime.now())

clean_text = clean_col(train_data, "text")
clean_title = clean_col(train_data, "title")
clean_combined = combine_cols(clean_text, clean_title, "combined")

train_data = pd.concat([train_data,
                        clean_text,
                        clean_title,
                        clean_combined
                       ], axis=1)

print(datetime.datetime.now())

train_data.head()

2019-05-29 15:43:43.983664


KeyError: ('combined', 'occurred at index 8476')

POS frequency distribution:

Two sets of columns: POS_title, POS_text

generated by lowercasing, removing punctuation, and removing stopwords.

ADJ, VERB, NOUN, ADVERB

In [45]:
from nltk.corpus import stopwords
from functools import reduce
from nltk import word_tokenize, WordPunctTokenizer
from nltk.corpus import wordnet


def tag_counts(pos_tags):
    counts = {}
    for word, tag in pos_tags:
        if tag in counts:
            counts[tag] += 1
        else:
            counts[tag] = 1
    return counts

def tag_dists(counts):
    total_words = sum(counts.values())
    dists = {}
    for tag, count in counts.items():
        dists[tag] = count / total_words
    
    return dists

def create_dists_series(text):
    return tag_dists(tag_counts(postag_text(text)))

def create_dists_df(df, col):
    return pd.DataFrame(list(df.apply(lambda row: create_dists_series(row[col]), axis=1)))
    #     as_series = df.apply(lambda row: create_dists_row(row["text"]), axis=1)

    #     pd.Dataframe({"n": as_series["n"]})
    
def create_title_dists(df):
    df = create_dists_df(df, "title")

    df = df.rename(columns={'a': 'title_adj_dist',
                            'n': 'title_noun_dist',
                            'v': 'title_verb_dist',
                            'r': 'title_adverb_dist'})
    # Titles are short, certain word types may not appear
    df = df.fillna(0)
    
    return df

def create_text_dists(df):
    df = create_dists_df(df, "text")

    df = df.rename(columns={'a': 'text_adj_dist',
                            'n': 'text_noun_dist',
                            'v': 'text_verb_dist',
                            'r': 'text_adverb_dist'})
    
    return df
    



Unnamed: 0,ID,title,text,label,title_adj_dist,title_noun_dist,title_adverb_dist,title_verb_dist,text_adj_dist,text_noun_dist,text_adverb_dist,text_verb_dist
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,0.333333,0.666667,0.000000,0.000000,0.210448,0.505970,0.073134,0.210448
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,0.181818,0.727273,0.000000,0.090909,0.150000,0.545833,0.083333,0.220833
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,0.200000,0.600000,0.000000,0.200000,0.253112,0.514523,0.024896,0.207469
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,0.125000,0.625000,0.000000,0.250000,0.226337,0.497942,0.041152,0.234568
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,0.400000,0.600000,0.000000,0.000000,0.196629,0.528090,0.028090,0.247191
5,6903,"Tehran, USA","\r\nI’m not an immigrant, but my grandparent...",FAKE,0.000000,1.000000,0.000000,0.000000,0.198268,0.485714,0.071861,0.244156
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE,0.000000,0.500000,0.000000,0.500000,0.182482,0.474453,0.062044,0.281022
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL,0.000000,1.000000,0.000000,0.000000,0.185714,0.514286,0.042857,0.257143
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL,0.000000,0.857143,0.000000,0.142857,0.163484,0.558442,0.045837,0.232238
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL,0.333333,0.444444,0.111111,0.111111,0.242925,0.481132,0.044811,0.231132


In [None]:
title_dists_df = create_title_dists(train_data)
text_dists_df = create_text_dists(train_data)

with_pos_tags = pd.concat([train_data.reset_index(), title_dists_df, text_dists_df], axis=1)
with_pos_tags

Vectorization removes punctuation and lowercases by default

I include stopwords to remove for unigrams


I keep them for bigrams and trigrams

In [59]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def unigram_counts(df, col="text"):
    vec = CountVectorizer(stop_words=stopwords.words("english"))
    X = vec.fit_transform(list(df[col]))
    
    return pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

def bigram_counts(df, col="text"):
    vec = CountVectorizer(ngram_range=[2,2])
    X = vec.fit_transform(list(df[col]))
    
    return pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

def trigram_counts(df, col="text"):
    vec = CountVectorizer(ngram_range=[3,3])
    X = vec.fit_transform(list(df[col]))
    
    return pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

def tf_idf(df, col="text"):
    vec = TfidfVectorizer(stop_words=stopwords.words("english"))
    X = vec.fit_transform(list(df[col]))
    
    return pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
    

bigram_counts(train_data.iloc[5:6], col="title")

#list(train_data.iloc[:1]["title"]
#stop_words=stopwords.words("english"), ngram_range=(1, 3)

#stopwords.words("english")

Unnamed: 0,tehran usa
0,1


Unnamed: 0,106,anger,anger dnc,anger dnc tried,battle,battle new,battle new york,bernie,bernie supporters,bernie supporters twitter,...,warn,watch,watch exact,watch exact moment,watches,watches boyfriend,watches boyfriend left,york,york primary,york primary matters
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,1,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,1,1,0,0,0,1,1,1,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,1,1,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,1,0,0,0
7,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 3. Classification modelling
In order to evaluate the performance of our models we will use 5-folds cross-validation.

<font color=red>

- Naive Bayes
- Max Entropy Classifier (Logistic Regression)
- SVM Classifier

Furhter steps
- Include bi-grams, tri-grams
- Include TF-IDF

</font>

## 4. Final model

## 5. Conclusion