# Machine Learning Intro

&nbsp;

## Traditional rule based systems / Software 1.0
The process is straightforward. The software engineer writes the rules based on which the system is making decisions. Then runs the program.

In [1]:
#decision rules
def make_decision(a, b):
    if (a < 15):
        if (b < 20):
            return True
        else:
            return False
    else:
        if (b < 20):
            return False
        else:
            return True 
        
#print output based on rules
make_decision(10, 34)

False

## Machine learning systems / Software 2.0
Software engineer writes a program that figures out the rules of decision making based on past examples of decisions. The process is more involved. Steps:
1. Collection of past examples of decision (training data)
2. Clean the data, transform it into the format your algorithm likes (binary classes, vectors, numbers)
3. Train then test your algorithm and fine-tune it
4. Run predictions on your model

In [3]:
#import machine learning algorithm
from sklearn import tree

# Steps 1 and 2. Training data (past decision examples)
train_x = [[10, 34], [9, 4], [45, 20], [14, 20], [15, 20], [14, 19], [15, 19], [22, 17]]
train_y = [ False,    True,    True,     False,    True,     False,    False,    False]

# Step 3. Train ML algo (make your computer figure out the decision rules behind the training data)
clf = tree.DecisionTreeClassifier()
clf.fit(train_x, train_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [4]:
#Step 4
print(clf.predict([[2, 2], [16, 17], [15, 20], [1, 32]]))

[ True False  True  True]


&nbsp; 

&nbsp;

&nbsp; 

&nbsp; 

# Let's explore the process of machine learning through a fictional story!

&nbsp; 

&nbsp; 

&nbsp; 

&nbsp;

&nbsp; 

&nbsp;

![title](pic.jpg)

# Welcome to Arstotzka!
### Things aren't going well and the country's food supplies are running short. If citizens want to eat, they have to send a letter to the Department of Food Supplies. The officials judge their requests and either reject them or send a slice of pizza.

### You're the department's software engineer and your boss told you automate request approval with this new thing called machine learning... Your boss gave you the past decisions in the file, data.csv.

# Step 1 - Get data

In [5]:
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

  (fname, cnt))
  (fname, cnt))


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,request_id,request_title,request_text_edit_aware,requester_received_pizza,requester_interests,account_age,requester_city
0,0,0,t3_l25d7,Request Colorado Springs Help Us Please,Hi I am in need of food for my 4 children we a...,False,[],792.420405,Nirsk
1,1,1,t3_rcb83,"[Request] California, No cash and I could use ...",I spent the last money I had on gas today. Im ...,False,"['AskReddit', 'Eve', 'IAmA', 'MontereyBay', 'R...",1122.279838,Vescillo
2,2,2,t3_lpu5j,"[Request] Hungry couple in Dundee, Scotland wo...",My girlfriend decided it would be a good idea ...,False,[],771.616181,Nirsk
3,3,3,t3_mxvj3,"[Request] In Canada (Ontario), just got home f...","It's cold, I'n hungry, and to be completely ho...",False,"['AskReddit', 'DJs', 'IAmA', 'Random_Acts_Of_P...",741.035602,Vescillo
4,4,4,t3_1i6486,[Request] Old friend coming to visit. Would LO...,hey guys:\n I love this sub. I think it's grea...,False,"['BrosWeightLoss', 'RandomActsOfCookies', 'Ran...",308.633819,Nirsk


In [6]:
df = df.drop(['Unnamed: 0','Unnamed: 0.1'], axis=1)
i = 1017
print('UID:\t\t', df['request_id'][i], '\n')
print('Title:\t\t', df['request_title'][i], '\n')
print('Text:\t\t', df['request_text_edit_aware'][i], '\n')
print('Received:\t', df['requester_received_pizza'][i], '\n')
print('Requester interests:\t', df['requester_interests'][i], '\n')
print('Account age:\t', df['account_age'][i], '\n')
print('Requester city:\t', df['requester_city'][i], '\n')


UID:		 t3_1fhc8g 

Title:		 [Request] Broke student, been living off pasta for a good few weeks. A pizza may save my stomach's sanity! (UK) 

Text:		 As lovely as cheesy pasta is, it gets a bit boring after a while! I'm heading back to work over summer in a few weeks, so I can pay it forward when I get my first paycheck through :) 

Received:	 False 

Requester interests:	 ['AdviceAnimals', 'Antihumor', 'AskReddit', 'CornyJoke', 'FanTheories', 'FestivalSluts', 'GifSound', 'Heavymind', 'Hipsterchicks', 'IAmA', 'InternetIsBeautiful', 'LucidDreaming', 'Minecraft', 'Music', 'Paranormal', 'ProjectReddit', 'Psychonaut', 'RedditDayOf', 'RepublicOfMusic', 'Shaboozey', 'ShittyFanTheories', 'Slender_Man', 'Southampton', 'SquaredCircle', 'StonerEngineering', 'Straightdan', 'Survive', 'UniversityofReddit', 'WTF', 'WWE', 'WeAreTheMusicMakers', 'Yogscast', 'askdrugs', 'askscience', 'circlebroke', 'circlejerk', 'civ', 'coestar', 'conspiracy', 'counting', 'creepy', 'cringe', 'dayz', 'dayzlfg', 'epheme

In [7]:
df["requester_received_pizza"].value_counts()

False                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        3045
True                                                                                                                                                                                                                                                                                                                                                                                                  

In [8]:
print(df['request_text_edit_aware'].isna().value_counts())
df = df.fillna("")
df = df[df["requester_received_pizza"] != "['2007scape', '3DS', 'AdviceAnimals', 'AnimalCrossing', 'AskReddit', 'Bowling', 'Eve', 'Foreversend', 'FryeMadden', 'FryeMadden2', 'HIMYM', 'IAmA', 'KarmaCourt', 'KarmaCourtAttorneys', 'KarmaCourtJury', 'KerbalSpaceProgram', 'Madden', 'Music', 'NBA2K13', 'PS4', 'Random_Acts_Of_Pizza', 'WTF', 'WhatsInThisThing', 'aviation', 'aww', 'cpp', 'fffffffuuuuuuuuuuuu', 'flightattendants', 'flying', 'funny', 'gamedev', 'gaming', 'gif', 'golf', 'halo', 'learnprogramming', 'loseit', 'mcservers', 'offmychest', 'pics', 'rebelmadden', 'roosterteeth', 'todayilearned', 'usu', 'videos', 'windowsphone', 'xboxone']"]

False    3937
True      104
Name: request_text_edit_aware, dtype: int64


Summary of fields:

**Input**:
 - `request_id`: unique identifier for the request 
 - `request_title`: title of the reddit post for pizza request
 - `request_text_edit_aware`: expository to request for pizza
 - `requester_interests`: collected tags on what the interests of the requester are
 - `account_age`: how old is the account
 - `requester_city`: city requester is from
 
**Output**:
 - `requester_recieved_pizza`: whether requester gets his/her pizza
 
 For our purpose let's choose the request text, interests and city as features to predict whether a person should receive a pizza.

# Step 2 - Preprocess data

### Split training data before vectorization

The first thing to do is to split our training data into 2 parts:

 - **training**: Use for training our model
 - **validation**: Use to check the "soundness" of our model 

In [12]:
from sklearn.model_selection import train_test_split 
train, valid = train_test_split(df, test_size=0.2)

### Vectorize the train and validation set

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import coo_matrix, hstack

stopwords = ["its","itself","they","them","their","theirs","themselves","what","which","who","whom","this","that","these","those","am","is","are","was","were","be","been","being","have","has","had","having","do","does","did","doing","a","an","the","and","but","if","or","because","as","until","while","of","at","by","for","    with","about","against","between","into","through","during","before","after","above","below","to","from","up","down","in","out","on","off","over","under","again","further","then","once","here","there","when","where","why","how","all","any","both","each","few","more","most","other","some","such","no","nor","not","only","own","same","so","than","too","very","s","t","can","will","just","don","should","now"]
#"i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves","he","him","his","himself","she","her","hers","herself","it"
    
train_y = train['requester_received_pizza']
valid_y = valid['requester_received_pizza']    

count_vect_text = CountVectorizer(stop_words=stopwords, max_features=4000, min_df=6)
train_x = coo_matrix(count_vect_text.fit_transform(train['request_text_edit_aware']))
valid_x = coo_matrix(count_vect_text.transform(valid['request_text_edit_aware']))
print(train_x.shape)

count_vect_int = CountVectorizer(max_features=500, min_df=3)
train_x_int = coo_matrix(count_vect_int.fit_transform(train['requester_interests']))
valid_x_int = coo_matrix(count_vect_int.transform(valid['requester_interests']))
print(train_x_int.shape)

count_vect_city = CountVectorizer()
train_x_city = coo_matrix(count_vect_city.fit_transform(train['requester_city']))
valid_x_city = coo_matrix(count_vect_city.transform(valid['requester_city']))
print(train_x_city.shape)

train_x = hstack([train_x, train_x_int, train_x_city]).toarray()
valid_x = hstack([valid_x, valid_x_int, valid_x_city]).toarray()

(3232, 2205)
(3232, 500)
(3232, 2)


# Step 3 - Training model

### Let's try first a Multinomial Naive Bayes model

In [14]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() 
clf.fit(train_x, train_y) 

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Let's get a sense of how good our classifier is on the validation set.

In [15]:
from sklearn.metrics import f1_score
predictions_valid = clf.predict(valid_x)
print('Pizza giving accuracy: ', f1_score(predictions_valid, valid_y, average='micro'))

Pizza giving accuracy:  0.7289603960396039


### Let's try another algorithm, Support Vector Machines

In [17]:
#best params: alpha=0.01, loss='log', penalty='elasticnet', l1_ratio=0, tol=None, max_iter=30
from sklearn.linear_model import SGDClassifier
clf_SVM = SGDClassifier(alpha=0.01, loss='log', penalty='elasticnet', l1_ratio=0, tol=None, max_iter=30) 
clf_SVM.fit(train_x, train_y) 
predictions_valid = clf_SVM.predict(valid_x)
print('Pizza giving accuracy: ', f1_score(predictions_valid, valid_y, average='micro'))

Pizza giving accuracy:  0.7735148514851485


# Step 4 - Run predictions on new data

In [18]:
test = pd.read_csv('new_data.csv')
test = test.fillna("")
test_x = coo_matrix(count_vect_text.transform(test['request_text_edit_aware']))
test_x_int = coo_matrix(count_vect_int.transform(test['requester_interests']))
test_x_city = coo_matrix(count_vect_city.transform(test['requester_city']))
test_x = hstack([test_x, test_x_int, test_x_city]).toarray()

predictions = clf.predict(test_x)

**Note:** Since we don't have the `requester_received_pizza` field in test data, we can't measure accuracy. But we can do some exploration as shown below.

In [19]:
pd.Series(predictions).value_counts()

False    1302
True      329
dtype: int64

# Step 5 - Assume something is wrong and find it 

In [20]:
pred_df = pd.concat([test, pd.Series(predictions)], axis=1)
received_pizza = pred_df[pred_df[0] == 'True']
print(received_pizza['requester_city'].value_counts())
print(pred_df['requester_city'].value_counts())

Nirsk       261
Vescillo     68
Name: requester_city, dtype: int64
Nirsk       843
Vescillo    788
Name: requester_city, dtype: int64
