# Train-Test Split  <span style="color:red"> TASK <span/>

Firstly, lets import pandas so we can handle tables effectively.

In [None]:
import pandas as pd

Golfer John likes to play golf, but only in certain conditions.  Below is a table of data relating to the different conditions in which Golfer John did and did not play golf. 

In [None]:
data = pd.read_csv('GolferJohn.csv')

Below are the top five rows of this dataset. 

In [None]:
data.head()

Split the table above into X data (features) and y data (in our case, whether or not Golfer John played golf).

In [None]:
X = data[['Weather', 'Temperature', 'Humidity', 'Windy', 'Ice cream available']]        ## Features:  Information relating to whether Golfer John played or not
y = data[['Play golf?']]                                                                ## Target:  Whether or not Golfer John played golf

Import the train_test_split function from the appropriate package:

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

Using the train_test_split function, split your data into X_train, X_test, y_train and y_test datasets.

In [None]:
## Put your code in this cell to split the data into the appropriate training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y)


Try doing the same train-test split but this time specifying that the test data is 30% of the total data available. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Try doing another train-test split, but this time specifying that the data is shuffled before being split. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=True)

**What is the benefit of using the 'random_state' parameter?**

ANSWER

**EXTENSION TASK:**  What does the train-test split's 'stratify' condition do?

ANSWER

Theoretically, we can now use this information to develop a Machine Learning model to predict whether or not Golfer John played in certain weather conditions.

# Decision Tree Classifier <span style="color:red"> TASK <span/>

The below data is for passengers on the Titanic.  We can use Machine Learning to train a model and predict whether a passenger would have survived or not on the Titanic.

In [None]:
from IPython.display import Image
Image(url= "https://titanichistoricalsociety.org/wp-content/uploads/2017/09/titanic_historical_society_homepage_harley_crossley.jpg", width=1000, height=1000)

The data below is a subset of the full data available on Kaggle:  https://www.kaggle.com/c/titanic#description

In [None]:
df = pd.read_csv('titanic_data.csv')
df = df.dropna()
df

Below is a brief description of each of the features.

| **Feature** |  **Description**   |
|:-----------|------------:|
| **Survived**       |        This tells us whether the person survived or not.  |
|**Pclass** | This is the class of the passenger's ticket. The possible classes are 1, 2 or 3.  |
|**Sex** | This is the gender of the passenger.  0 indicates female, 1 indicates male. |
|**Age** | This is the age of the passenger in years. |
|**SibSp** | This indicates the number of siblings or spouses (brother, sister, husband, wife) that the passenger had on board with them. |
|**Fare** | This is how much the passenger paid for their ticket. |
|**Parch** | This is the number of parents or children (mother, father, son, daughter) that the passenger had on board with them.|


Split the data into <span style="color:red"> **features** </span> and <span style="color:red"> **target information**</span>.

In [None]:
X = df[['Pclass','Age','SibSp','Parch', 'Sex']]
y = df[['Survived']]

Split the data into training and testing data by filling in the blanks.

In [None]:
X_train , X_test , y_train , y_test = train_test_split(_______ , _______, random_state=2, test_size=0.3)

We will need to import the model from sklearn.  Googling sklearn Decision Tree gives you this link:

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

We can then import the Decision Tree Classifier from this package.

In [None]:
from sklearn.tree import DecisionTreeClassifier

To build the model we start by calling a Decision Tree classifier.  Then we fit the model to the X_train and y_train data.  

In [None]:
clf = DecisionTreeClassifier()
clf.fit(______ , _______)

Once you have trained the model, try making predictions using the features from the test data.  Store this as y_pred. 

In [None]:
y_pred = clf.predict(___input_the_features_from_your_test_data_here___)

Measure the accuracy of your classifier using sklearn's **accuracy_score** function.  It takes the 'truthful' outcomes as the first argument and the predictions as the second argument.

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
accuracy = accuracy_score(y_test , y_pred)
print(accuracy)

**EXTENSION TASK:** How does our model compare to someone who just predicted that everyone dies or everyone survives?  (Try counting how many people survived/died in the test data)

ANSWER

# Random Forest Classifier <span style="color:red"> TASK <span/>

Try using the same data from the Titanic example in the Decision Tree Classifier section but this time, try using the Random Forest Classifier.

In [None]:
df = pd.read_csv('titanic_data.csv')
df.head()

Split the data into features and targets (X and y data).

Split this into training and testing data.

Import the Random Forest Classifier from the appropriate package:

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
from _____ import ______

Instantiate a model and then train it by using the .fit() method

Make predictions using your model

Test the accuracy of your model

# One-hot encoding <span style='color:red'> TASK <span/>

In [None]:
df = pd.read_csv('CountryInfo.csv')

In [None]:
df

Try one-hot encoding this information to produce another table

What happens to the 'Country' column?  What happens to the 'Revenue' column?

# Poor Quality Data Task  <span style='color:red'> TASK <span/>

The code below imports all tables on the wikipedia page about the Global Peace Rankings

In [None]:
url = 'https://en.wikipedia.org/wiki/Global_Peace_Index#Global_Peace_Index_rankings'
dfs = pd.read_html(url)

We are interested in the second table on this web page, which is referenced with index [1] (because Python counts from zero)

In [None]:
df = dfs[1]

We need to tidy this code.

*Note:*  The steps below will remove unnecessary columns (ranking columns) and will re-assign the 'index' and the 'header'.

In [None]:
columns_to_remove = [i for i in df.loc[0] if 'rank' in i]        # For the purpose of this task we do not want column information about 'rank'
new_columns = [i.split('[')[0] for i in df.loc[0] if 'score' in i]       # We do not want the additional references that appear in each column
df.columns = df.loc[0]       # Re-assign the columns using the information in the first row
df = df.drop(0)        # Get rid of the row that has just been 'copied' into the header 
df = df.drop(columns=columns_to_remove)        # Remove the columns identified in 'columns_to_remove'
df = df.set_index('Country')        # Set the 'Country' column as the index
df.columns = new_columns       # Re-assign the column titles without the additional references (listed in the 'new_columns' list)

Our table has lots of entries

In [None]:
print(len(df))

Below are the top 5 entries

In [None]:
df.head()

The below line of code shows us all rows in the dataframe where any entry is missing

In [None]:
df[pd.isnull(df).any(axis=1)]

We can either drop these rows using the dropna method:

In [None]:
drop_nulls = df.dropna()

This leaves fewer entries in our table

In [None]:
print(len(drop_nulls))

Alternatively, we can replace the Nulls with a value of our choosing.

In [None]:
replace_nulls = df.fillna(0)

The below line of code shows all the rows that contain a zero

In [None]:
replace_nulls[(replace_nulls.T == 0).any()]

Below is another table with information on the average and highest attendances at each of the World Cups from 1930 to date.    

In [None]:
url = 'https://en.wikipedia.org/wiki/FIFA_World_Cup'        ## This is the webpage we are importing the table from
df = pd.read_html(url, header=0)[2]          ## We are importing the third table 
df = df.drop([0,22])            ## Remove two unhelpful rows
df = df.set_index('Year')          ## Set the 'Year' column as the index 
df['Highest attendances †'] = df['Highest attendances †'].apply(lambda x: float(x.split('[')[0].replace(',','')))       ##  Remove the square bracket reference
df.columns = ['Hosts', 'Venues/Cities', 'Totalattendance', 'Matches',         
       'Tournament Avg. attendance', 'Highest attendance', 'Stadium of Highest Attendance', 'Score/Match']        ## Change the column names
df['Score/Match'] = df['Score/Match'].apply(lambda x: x.replace(', ',' (')+')')         ## Tidy one of the columns 
df.set_value('1934', 'Tournament Avg. attendance', 100000000)           ## Create one 'bogus' entry in the table 
df.set_value('2006', 'Highest attendance', None)          ## Create one Null entry in the table  

Try removing any rows with missing data.  Try replacing any missing data with a value of your choice.

In [None]:
drop_nulls = _________

In [None]:
replace_nulls = _________

**QUESTION** 
Are there any records that surprise you in 'Tournament Avg. attendance'?  Which record surprises you and why?  What could you do with this record?

ANSWER

# SVM  <span style='color:red'> TASK <span/>

Below is a series of ingredients lists for Cupcakes, Muffins and Scones.

In [None]:
data = pd.read_csv('baking.csv', sep=',')

In [None]:
data = data[data['Type'].isin(['Muffin','Cupcake'])]

Based on the ingredients we hope to be able to predict whether a recipe is for Cupcakes or Muffins.

Create your X and y data, then split this into training and testing data.

In [None]:
X = data.drop(columns=['Type'])

In [None]:
y = data['Type'].tolist()

In [None]:
y_example = data['Type']

You will need to convert your 'y' data into 1s and 0s.  Firstly, convert your y data into a list.  Then try converting each element in your list using a dictionary.

In [None]:
conversion_dictionary = {'Cupcake': 1,
                         'Muffin': 0}

In [None]:
y = [conversion_dictionary[whatever] for whatever in y]

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y)

Import the appropriate SVC package:

http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

In [None]:
from _____ import ______

Instantiate your model, then train your model.

Make predictions using your model and save them as y_pred

Test the accuracy of your model

How well did your model perform?  Do you have any concerns about the model you have built?  How could you improve your model?

# KNN  <span style='color:red'> TASK <span/>

SKLearn has some inbuilt datasets, such as the 'iris' dataset (a famous dataset about flowers).  You can import these datasets using the 'load' functions created for them.

In [None]:
from sklearn.datasets import load_iris

In [None]:
df = pd.concat([pd.DataFrame(load_iris()['data']), pd.DataFrame(load_iris()['target'])], axis=1)
df.columns = load_iris()['feature_names'] + ['species']

flower_dict = {0: 'setosa',
              1: 'versicolor',
              2: 'virginica'
              }

df['species'] = df['species'].apply(lambda x: flower_dict[x])

In [None]:
df.head()

![iris](iris.png "iris")

![sepal](sepal_petal.jpg "sepal_petal")

We can only visualize data in 3 dimensions, so here are some plots of the data in 3 dimensions.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

In [None]:
import itertools

In [None]:
set(df['species'])

In [None]:
color_map = {'setosa': 'g',
             'versicolor': 'r', 
             'virginica': 'b'}

The graphs below show three different features (out of the four available) plotted against one another

In [None]:
for combo in itertools.combinations(df.columns[:-1], 3):
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    for species in df['species']:
        temp = df[df['species']==species]
        plt.scatter(temp[combo[0]], temp[combo[1]], temp[combo[2]], color=color_map[species])
        ax.set_xlabel(combo[0])
        ax.set_ylabel(combo[1])
        ax.set_zlabel(combo[2])
        
    plt.show()

We can use the four dimensions to classfiy the datapoints.  Try splitting your data into **features** and **targets** before splitting into **training** and **testing** data.  You can then build a KNeighbours model to predict each flower type.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Linear Regression  <span style='color:red'> TASK <span/>

Below is a dataset of cars.  Each car has information about its features including the 'miles per gallon' that the car can achieve.

In [None]:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original', delim_whitespace=True, header=None)
data.columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin', 'car name']
data['origin'] = data['origin'].apply(lambda x: {1: 'American',
                                                 2: 'European',
                                                 3: 'Asian/Other'}[x])

In [None]:
data.head()

**QUESTION**

Can you build a linear regression model to predict the 'miles per gallon' given the features provided to you?  How will you measure the performance of your model?

# Text Processing <span style='color:red'> TASK <span/>

Below is a dataset feature movie reviews from IMDB and their general sentiment.  'Polarity' represents either a good review (1) or a bad review (0).

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
import io
import pandas as pd
import requests
url = "https://raw.githubusercontent.com/SrinidhiRaghavan/AI-Sentiment-Analysis-on-IMDB-Dataset/master/imdb_tr.csv"
s = requests.get(url).text
df = pd.read_csv(io.StringIO(s))
df = df.set_index('row_Number')

In [None]:
df.head()

Try building a model that predicts whether a review is good or bad by converting the text into a vector, before passing it through a model that you think will be suitable for the task.

First, lets identify our X and y data, then create training and testing data

In [None]:
X = df['text']
y = df['polarity']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

Lets now create a vectorizer that transforms your text into a vector.  To improve the performance of your model you will need to tweak the parameters of your vectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(ngram_range=(1,1), 
                     max_features=1000,
                    stop_words='english',
                    max_df=1,
                    min_df=1)

Now we need to fit the vectorizer on the corpus of texts in our training data.  We can then transform our training and testing data in a series of vectors to form a matrix.

In [None]:
cv.fit(X_train)

Once we have taught the vectorizer all of the words that it needs to pay attention to (from the training data) we can transform each individual text into a vector

In [None]:
X_train_vec = cv.transform(X_train)
X_test_vec = cv.transform(X_test)

The **X_train_vec** and **X_test_vec** information is now 'consumable' by a model.  We can use X_train_vec and the target (y_train) to create our model.  We can then make predictions with X_test_vec and compare them to y_test.  Try different models, different model parameters and different vectorizer parameters to see what performance you achieve!

In [None]:
from ___________ import ____________
clf = ______________(tweaked_parameter='')

In [None]:
clf.fit(_____, ______)