# Neural net classifier to simple rules

## Project Title

"Inducing human-interpretable rules for automated document classification"

## Project definition

We would like to explore the derivation of simple rule-based rules for text classification by using state-of-the-art text classifiers based on deep learning. 

To that end we will first generate a corpus of twitter message. The goal will be to classify users into groups with opposing polarity, e.g. on a political issue. We will derive initial class labels from simple rules, e.g. the use of characteristic hashtags.

Then we will use the labelled documents to train a classifier based on a neural network architecture. Finally, we will derive from the resulting classifier human-interpretable rules that preserve classification accuracy as far as possible.

In [2]:
from itertools import islice
import csv
import pandas as pd

## Load the data

### Get the user ranking (author_id, name, trumpscore)

In [3]:
ranking_user = []
with open("data/trump.dump", encoding="utf8") as f:
    i = 50856403
    for line  in islice(f, 50856403, 53079912):
        i+=1
        if(i%1000000==0):
            print(i)
        document=[]
        elems = line.split("\t")
        author_id = elems[0]
        name = elems[1]
        authority = elems[4]
        document.append(author_id)
        document.append(name)
        document.append(authority)
        ranking_user.append(document)

51000000
52000000
53000000


In [4]:
len(ranking_user)

2223509

In [5]:
df_user = pd.DataFrame(ranking_user, columns=["author_id", "username", "trumpscore"])

In [6]:
df_user["author_id"] = pd.to_numeric(df_user["author_id"])

In [7]:
df_user["trumpscore"] = pd.to_numeric(df_user["trumpscore"])

In [8]:
df_user.head()

Unnamed: 0,author_id,username,trumpscore
0,1781551,dbernstein,-0.781121
1,178434460,drturpin,-0.888996
2,17862067,EmpressNorton,-0.843733
3,17882773,RealJim,-0.743249
4,17900914,TheJohnNantz,0.660194


In [9]:
df_user.to_csv("data/user_trumpscore.csv")

### Get the type of the documents

In [10]:
documents_type = []
with open("data/trump.dump", encoding="utf8") as f:
    i = 16336370
    for line  in islice(f, 16336370, 23505983):
        i+=1
        if(i%1000000==0):
            print(i)
        document=[]
        elems = line.split("\t")
        doctype = elems[7]
        document.append(doctype)
        if doctype != "\\N":
            documents_type.append(document)

17000000
18000000
19000000
20000000
21000000
22000000
23000000


In [11]:
settype = set([item for sublist in documents_type for item in sublist])

In [12]:
settype

{'instagram', 'twitter', 'web'}

### Should I keep only twitter or also other type of data ? 

#### I only take tweets for the moment

In [7]:
documents = []
with open("data/trump.dump", encoding="utf8") as f:
    i = 16336370
    for line  in islice(f, 16336370, 23505983):
        i+=1
        if(i%1000000==0):
            print(i)
        document=[]
        elems = line.split("\t")
        username = elems[2]
        author_id = elems[1]
        body = elems[5]
        doctype = elems[7]
        document.append(author_id)
        document.append(username)
        document.append(body)
        if doctype == "twitter":
            if (username!="\\N") | (author_id!="\\N"):
                documents.append(document)

17000000
18000000
19000000
20000000
21000000
22000000
23000000


In [193]:
len(documents)

6672171

In [310]:
df_docs = pd.DataFrame(documents, columns=["author_id", "username", "body"])

In [311]:
df_docs["author_id"] = pd.to_numeric(df_docs["author_id"])

In [312]:
df_docs.head()

Unnamed: 0,author_id,username,body
0,2286887581,\N,"@WhiteHouse @POTUS Pass the vomit bag, please."
1,15725935,\N,@realDonaldTrump senate dems are being deliberate
2,15768846,\N,Thank you @MarkWarner https://t.co/s6UPKZThQA
3,706566014369185793,\N,@WhiteHouse @POTUS I always have Faith
4,21218159,\N,"Well played, @mcsweeneys https://t.co/72XDDBM6k1"


In [313]:
df_docs.to_csv("data/docs_tweets.csv")

## Join dataframe to obtain score on tweets

In [13]:
df_user = pd.read_csv("data/user_trumpscore.csv", index_col=0)
df_user.head()

  mask |= (ar1 == a)


Unnamed: 0,author_id,username,trumpscore
0,1781551,dbernstein,-0.781121
1,178434460,drturpin,-0.888996
2,17862067,EmpressNorton,-0.843733
3,17882773,RealJim,-0.743249
4,17900914,TheJohnNantz,0.660194


In [14]:
len(df_user)

2223509

In [17]:
len(df_user[df_user['trumpscore']>0.1])

47271

In [18]:
len(df_user[df_user['trumpscore']<-0.1])

45493

In [320]:
df_docs = pd.read_csv("data/docs_tweets.csv", index_col=0, encoding = "ISO-8859-1")
df_docs.head()

  mask |= (ar1 == a)


Unnamed: 0,author_id,username,body
0,2286887581,\N,"@WhiteHouse @POTUS Pass the vomit bag, please."
1,15725935,\N,@realDonaldTrump senate dems are being deliberate
2,15768846,\N,Thank you @MarkWarner https://t.co/s6UPKZThQA
3,706566014369185793,\N,@WhiteHouse @POTUS I always have Faith
4,21218159,\N,"Well played, @mcsweeneys https://t.co/72XDDBM6k1"


In [321]:
len(df_docs)

6672171

In [322]:
df_docs.drop('username', axis=1, inplace=True)

In [323]:
df_docs_score = df_docs.set_index('author_id').join(df_user.set_index('author_id'))

In [328]:
df_docs_score.head()

Unnamed: 0_level_0,body,username,trumpscore
author_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12,We're all directly accelerating climate change...,jack,-0.617434
12,This country was founded on one self-evident a...,jack,-0.617434
12,I've been CEO of both companies for over 3 mon...,jack,-0.617434
12,"My focus is to build teams that move fast, and...",jack,-0.617434
12,"Good to know!: ""we also prepare and serve cric...",jack,-0.617434


In [353]:
df_docs_score.reset_index(inplace=True)

In [332]:
len(df_docs)

6672171

In [326]:
# Remove NaN scores
df_docs_score = df_docs_score[pd.notnull(df_docs_score['trumpscore'])]

In [331]:
len(df_docs_score)

5519028

In [333]:
# Remove 0.0 scores
df_docs_score = df_docs_score[df_docs_score['trumpscore']!=0.0]

In [355]:
len(df_docs_score)

3472172

In [356]:
df_docs_score.head(25)

Unnamed: 0,author_id,body,username,trumpscore
0,12,We're all directly accelerating climate change...,jack,-0.617434
1,12,This country was founded on one self-evident a...,jack,-0.617434
2,12,I've been CEO of both companies for over 3 mon...,jack,-0.617434
3,12,"My focus is to build teams that move fast, and...",jack,-0.617434
4,12,"Good to know!: ""we also prepare and serve cric...",jack,-0.617434
5,12,@mcnees @Support not sure how this got past us...,jack,-0.617434
6,12,Excited to welcome @candi to lead our inclusio...,jack,-0.617434
7,12,2. Some people who unfollowed @POTUS in the pa...,jack,-0.617434
8,12,Grateful for 11 years of people using Twitter ...,jack,-0.617434
9,12,Square is the most powerful set of tools to st...,jack,-0.617434


### Now do we keep all users, independant of their trumpscore? Or should I keep only users with strong trumpscore (negative or positive) ?

#### For the moment I take them all

In [360]:
df_docs_labelled = pd.DataFrame(df_docs_score)

In [361]:
# For-Trump label = 1, Anti-Trump label = 0
def labelize(row):
    if row["trumpscore"]>0:
        return 1
    else:
        return 0

In [362]:
df_docs_labelled['label'] = df_docs_score.apply(lambda row: labelize(row), axis=1)

In [364]:
df_docs_labelled.head()

Unnamed: 0,author_id,body,username,trumpscore,label
0,12,We're all directly accelerating climate change...,jack,-0.617434,0
1,12,This country was founded on one self-evident a...,jack,-0.617434,0
2,12,I've been CEO of both companies for over 3 mon...,jack,-0.617434,0
3,12,"My focus is to build teams that move fast, and...",jack,-0.617434,0
4,12,"Good to know!: ""we also prepare and serve cric...",jack,-0.617434,0


In [365]:
df_docs_score.to_csv("data/tweets_labelled.csv")