
# Machine Learning Intro
Let's build a common understanding of machine learning by coding up the same decision tree in the traditional way and in the machine learning way.

---

## Traditional rule based systems / Software 1.0
The process is straightforward. The software engineer writes the rules based on which the system is making decisions. Then runs the program.

In [1]:
# Function returns True or False depending on whether variable a and b are smaller than 15 and 20 respectively.
# Decision rules.
def make_decision(a, b):
    if (a < 15):
        if (b < 20):
            return True
        else:
            return False
    else:
        if (b < 20):
            return False
        else:
            return True 
        
#print output based on rules
make_decision(10, 34)

False

## Machine learning systems / Software 2.0
Software engineer writes a program that figures out the rules of decision making based on past examples of decisions. The process is more involved. Steps:
1. Collection of past examples of decision (training data)
2. Clean the data, transform it into the format your algorithm likes (binary classes, vectors, numbers)
3. Train then test your algorithm and fine-tune it
4. Run predictions on your model

In [15]:
#import machine learning algorithm
from sklearn import tree

# Steps 1 and 2. Training data (past decision examples)
train_x = [[10, 34], [9, 4], [45, 20], [14, 20], [15, 20], [14, 19], [15, 19], [22, 17]]
train_y = [ False,    True,    True,     False,    True,     False,    False,    False]

# Step 3. Train ML algo (make your computer figure out the decision rules behind the training data)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(train_x, train_y)

In [4]:
#Step 4
print(clf.predict([[2, 2], [16, 17], [15, 20], [1, 32]]))

[ True False  True  True]





&nbsp; 


&nbsp; 

&nbsp; 

---
### That's a very simple example, so let's take one more step forward.

# Let's explore the process of machine learning through a fictional story!

&nbsp; 

![title](picture.jpg)

# Welcome to Arstotzka!
### Things aren't going well and the country's food supplies are running short. 
### If citizens want to eat, they have to send a letter to the Department of Food Supplies. The officials judge their requests and either reject them or send a slice of pizza.

### You're the department's software engineer and your boss told you automate request approval with this new thing called machine learning... Your boss gave you the past decisions in the file, data.csv.

---
# Step 1 - Get data

In [117]:
import pandas as pd
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,request_id,request_title,request_text_edit_aware,requester_received_pizza,requester_interests,account_age,requester_city
0,0,0,t3_l25d7,Request Colorado Springs Help Us Please,Hi I am in need of food for my 4 children we a...,False,[],792.420405,Nirsk
1,1,1,t3_rcb83,"[Request] California, No cash and I could use ...",I spent the last money I had on gas today. Im ...,False,"['AskReddit', 'Eve', 'IAmA', 'MontereyBay', 'R...",1122.279838,Vescillo
2,2,2,t3_lpu5j,"[Request] Hungry couple in Dundee, Scotland wo...",My girlfriend decided it would be a good idea ...,False,[],771.616181,Nirsk
3,3,3,t3_mxvj3,"[Request] In Canada (Ontario), just got home f...","It's cold, I'n hungry, and to be completely ho...",False,"['AskReddit', 'DJs', 'IAmA', 'Random_Acts_Of_P...",741.035602,Vescillo
4,4,4,t3_1i6486,[Request] Old friend coming to visit. Would LO...,hey guys:\n I love this sub. I think it's grea...,False,"['BrosWeightLoss', 'RandomActsOfCookies', 'Ran...",308.633819,Nirsk


&nbsp;

### As you can see, the first three columns are just duplicates, so let's drop them.

In [118]:
df = df.drop(['Unnamed: 0','Unnamed: 0.1'], axis=1)

### The table is too big to see. Let's pretty print the 1018th row!

In [119]:
i = 1017
print('UID:\n', df['request_id'][i], '\n')
print('Title:\n', df['request_title'][i], '\n')
print('Text:\n', df['request_text_edit_aware'][i], '\n')
print('Received:\n', df['requester_received_pizza'][i], '\n')
print('Requester interests:\n', df['requester_interests'][i].replace("'", ""), '\n')
print('Account age:\t', df['account_age'][i], '\n')
print('Requester city:\t', df['requester_city'][i], '\n')

UID:
 t3_1fhc8g 

Title:
 [Request] Broke student, been living off pasta for a good few weeks. A pizza may save my stomach's sanity! (UK) 

Text:
 As lovely as cheesy pasta is, it gets a bit boring after a while! I'm heading back to work over summer in a few weeks, so I can pay it forward when I get my first paycheck through :) 

Received:
 False 

Requester interests:
 [AdviceAnimals, Antihumor, AskReddit, CornyJoke, FanTheories, FestivalSluts, GifSound, Heavymind, Hipsterchicks, IAmA, InternetIsBeautiful, LucidDreaming, Minecraft, Music, Paranormal, ProjectReddit, Psychonaut, RedditDayOf, RepublicOfMusic, Shaboozey, ShittyFanTheories, Slender_Man, Southampton, SquaredCircle, StonerEngineering, Straightdan, Survive, UniversityofReddit, WTF, WWE, WeAreTheMusicMakers, Yogscast, askdrugs, askscience, circlebroke, circlejerk, civ, coestar, conspiracy, counting, creepy, cringe, dayz, dayzlfg, ephemera, explainlikeimfive, facepalm, fffffffuuuuuuuuuuuu, fifthworldproblems, footballmanagergam

### Let's investigate how many people actually received pizzas in the past.

In [120]:
df["requester_received_pizza"].value_counts()

False                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        3045
True                                                                                                                                                                                                                                                                                                                                                                                                  

&nbsp;

### Ooops, there must be a mistake in the data, which reminds me that we should drop N/A values, and we should only work with rows that have True/False values in their "requester_received_pizza" column

In [121]:
df = df.fillna("")
df = df[df["requester_received_pizza"].isin(['True','False'])]

#Let's try again
df["requester_received_pizza"].value_counts()

False    3045
True      994
Name: requester_received_pizza, dtype: int64

### Summary of fields:

**Input data**:
 - `request_id`: unique identifier for the request 
 - `request_title`: title of the reddit post for pizza request
 - `request_text_edit_aware`: expository to request for pizza
 - `requester_interests`: collected tags on what the interests of the requester are
 - `account_age`: how old is the account
 - `requester_city`: city requester is from
 
**Output decision made**:
 - `requester_recieved_pizza`: whether requester gets his/her pizza
 
For our purpose let's choose the request text, interests and city as features to predict whether a person should receive a pizza.

---
# Step 2 - Preprocess data

### Split training data before vectorization

The first thing to do is to split our training data into 2 parts:

 - **training**: Use for training our model
 - **validation**: Use to check the "soundness" of our model, by running the prediction on the data and checking how many did our model get right.

In [122]:
from sklearn.model_selection import train_test_split 
train, valid = train_test_split(df, test_size=0.2)

### Data formats and Countvectorizing
Machine learning algorithms in most cases don't know what to do with characters and strings, so we'll need to transform our strings. One way to do so is countvectorization. But what is that? Here's an example:

In [123]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import coo_matrix, hstack
# create countvectorizer object
count_vect_text = CountVectorizer()

# the count vectorizer just counts the frequency of each unique word in the sequence.
myString = ["She sells sea shells. The shells she sells are surely sea shells."]

# the output is in the format:   (<doc indx-ignore>, <indx in the string>)  <number of times it occurs>
print(count_vect_text.fit_transform(myString))

# therefore from the result we see that for instance:
#  - the words at positions 5,0 and 6 in the string occur once referring to "She", "sells", "surely"
#  - the one in the 4th position occurs 3 times referring to "shells"

  (0, 5)	1
  (0, 0)	1
  (0, 6)	1
  (0, 4)	3
  (0, 1)	2
  (0, 2)	2
  (0, 3)	2


### Let's countvectorize our whole training and test sets

In [124]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import coo_matrix, hstack

stopwords = ["its","itself","they","them","their","theirs","themselves","what","which","who","whom","this","that","these","those","am","is","are","was","were","be","been","being","have","has","had","having","do","does","did","doing","a","an","the","and","but","if","or","because","as","until","while","of","at","by","for","    with","about","against","between","into","through","during","before","after","above","below","to","from","up","down","in","out","on","off","over","under","again","further","then","once","here","there","when","where","why","how","all","any","both","each","few","more","most","other","some","such","no","nor","not","only","own","same","so","than","too","very","s","t","can","will","just","don","should","now"]
#"i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself","yourselves","he","him","his","himself","she","her","hers","herself","it"
    
train_y = train['requester_received_pizza']
valid_y = valid['requester_received_pizza']    

count_vect_text = CountVectorizer(stop_words=stopwords, max_features=4000, min_df=6)
train_x = coo_matrix(count_vect_text.fit_transform(train['request_text_edit_aware']))
valid_x = coo_matrix(count_vect_text.transform(valid['request_text_edit_aware']))
print(train_x.shape)

count_vect_int = CountVectorizer(max_features=500, min_df=3)
train_x_int = coo_matrix(count_vect_int.fit_transform(train['requester_interests']))
valid_x_int = coo_matrix(count_vect_int.transform(valid['requester_interests']))
print(train_x_int.shape)

count_vect_city = CountVectorizer()
train_x_city = coo_matrix(count_vect_city.fit_transform(train['requester_city']))
valid_x_city = coo_matrix(count_vect_city.transform(valid['requester_city']))
print(train_x_city.shape)

train_x = hstack([train_x, train_x_int, train_x_city]).toarray()
valid_x = hstack([valid_x, valid_x_int, valid_x_city]).toarray()

(3231, 2230)
(3231, 500)
(3231, 2)


---
# Step 3 - Training model
Finally, we arrived to the part everyone was waiting for! In this step, we'll just test a few different classifiers to see how they perform.

### Let's try first a Multinomial Naive Bayes model

In [125]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB() 
clf.fit(train_x, train_y) 

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Let's get a sense of how good our classifier is on the validation set.

In [126]:
from sklearn.metrics import f1_score
predictions_valid = clf.predict(valid_x)
print('Pizza giving accuracy: ', f1_score(predictions_valid, valid_y, average='micro'))

Pizza giving accuracy:  0.7314356435643565


### Let's try another algorithm, Support Vector Machines

In [127]:
#best params: alpha=0.01, loss='log', penalty='elasticnet', l1_ratio=0, tol=None, max_iter=30
from sklearn.linear_model import SGDClassifier
clf_SVM = SGDClassifier(max_iter=5, tol=None) 
clf_SVM.fit(train_x, train_y) 
predictions_valid = clf_SVM.predict(valid_x)
print('Pizza giving accuracy: ', f1_score(predictions_valid, valid_y, average='micro'))

Pizza giving accuracy:  0.7277227722772277


### We get better results with the SVM, but oftentimes it helps to tweak some parameters of the algorithm. I played around to find the best combination:

In [128]:
from sklearn.linear_model import SGDClassifier
clf_SVM = SGDClassifier(alpha=0.01, loss='log', penalty='elasticnet', l1_ratio=0, tol=None, max_iter=30) 
clf_SVM.fit(train_x, train_y) 
predictions_valid = clf_SVM.predict(valid_x)
print('Pizza giving accuracy: ', f1_score(predictions_valid, valid_y, average='micro'))

Pizza giving accuracy:  0.7735148514851485


---
# Step 4 - Run predictions on new data
Now we take the data that we have to make decisions on. This dataset has all the features and the algorithm will decide whether these people will get pizzas or not based on the rules it learnt from past decisions.

In [129]:
test = pd.read_csv('new_data.csv')
test = test.fillna("")
test_x = coo_matrix(count_vect_text.transform(test['request_text_edit_aware']))
test_x_int = coo_matrix(count_vect_int.transform(test['requester_interests']))
test_x_city = coo_matrix(count_vect_city.transform(test['requester_city']))
test_x = hstack([test_x, test_x_int, test_x_city]).toarray()

predictions = clf.predict(test_x)

**Note:** Since we don't have the `requester_received_pizza` field in test data, we can't measure accuracy. But we can do some exploration as shown below.

In [130]:
pd.Series(predictions).value_counts()

False    1312
True      319
dtype: int64

---
# Very niiice!
Out of the 1600 new requests, our algorithm rejected 1303 and approved 328, which is similar to the original manual decisions. 

**Now the government of Arstotzka can fire all its employees, and only employ this one algorithm! **

---
# Step 4+1 - Assume something is wrong and find it 

In [131]:
# let's concat the predictions and features into one table, and select the rows that received a pizza
pred_df = pd.concat([test, pd.Series(predictions)], axis=1)
received_pizza = pred_df[pred_df[0] == 'True']

# as a random guess let's look at where are the people with pizzas from?
print("Cities of pizza receivers: \n",received_pizza['requester_city'].value_counts())

# let's look at how many people are from those cities in general
print("\nTotal requests from each city: \n",pred_df['requester_city'].value_counts())

Cities of pizza receivers: 
 Nirsk       253
Vescillo     66
Name: requester_city, dtype: int64

Total requests from each city: 
 Nirsk       843
Vescillo    788
Name: requester_city, dtype: int64


# Do you see something wrong?

Nirsk and Vescillo have roughly the same population, but for some reason Nirsk received 3.4 times more pizza!! We'll get back to this below.

### This was a lot, right? Let's recap what we just did!
1. short intro into Machine Learning
- we moved onto a bigger project
- loaded pizza data
- split into training and test sets
- vectorized text
- tested different models (Naive bayes, SVM)
- saw that there is a bias in the results towards one city

It is crucial in machine learning to understand the biases in the training dataset, because the algorithms will only learn and reinforce those. 

[A real life example](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) is when people used an algorithm to decide whether a prison inmate should be pardonned. The problem was that the judges were racially biased towards white people and therefore the algorithm also learned that black people should stay in prison longer.

In our fictional Arstotzka example, the people making the original decision might be from Nirsk and sympathized with those people more. However, this is not the behaviour we want our algorithm to have.

The same kind of bias in real world datasets has serious consequences, when they are applied to university admissions, insurance, loans and criminal predictions. Engineers have the responsibility to find the biases and fix them. [These scientists managed to fix gender bias in the Google Word2Vec model](https://arxiv.org/pdf/1607.06520.pdf).
