# Chapter 1. The Machine Learning Landscape

## What is Machine Learning?

Machine Learning is the science of programming computers so they can learn from data.

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. -- Tom Mitchell, 1997.

Example: spam filter
Task T: to flag spam for new emails
Experience E: existing emails with labels (either spam or non-spam)
Performance measure P: ratio of correctly classified emails

Non-example: the collection of Wikipedia pages

## Why Use Machine Learning?

Traditional approach for spam filter:
1. Choose features of spam emails manually: "4U", "credit card", "free", "amazing"
2. Write an program to detect exactly the features you chose
3. Test the program and modify the features until satisfactory

Drawbacks: 
1. A large amount of features are needed - hard to maintain
2. Spammers may change their writing to avoid explicit rules: change "4U" to "For U".
3. For some complex problems, manually-engineered features are not good enough: hand-written digits

Machine Learning models:
1. Automatically learns which words and phrases are good predictors of spam. 
2. Since the program is not a stack of explicit rules, it is much shorter, easier to maintain, and most likely more accurate.
3. With new training data, the Machine Learning model can update automatically to capture new indicators of spam emails.

What Machine Learning is great for:
1. Problems for which existing solutions require a lot of hand-tuning or long lists of rules: one Machine Learning algorithm can often simplify code and perform better.
2. Complex problems for which there is no good solution at all using a traditional approach: the best Machine Learning techniques can find a solution.
3. Fluctuating environments: a Machine Learning system can adapt to new data.
4. Large amount of data: With Machine Learning, computers process big data faster than human.


## Various Types of Machine Learning
- Supervised vs. Unsupervised Learning
- Semisupervised, reinforcement, transfer, adverseral learning...
- Online learning
- Instance based vs. model based learning

## Challenges of Machine Learning
- Insufficient quantity of training data
- Non-representative training data
- Irrelevant features

## Two Machine Learning Guidelines
- No Free Lunch Theorem
- Curse of Dimensionality

## A Machine Learning Example: Does money make people happier?

### Step 1: Look for data
Suppose you want to know if money makes people happy, so the first thing is to find data that characterize how wealthy and happy people are. Looking for informational data is in general not a easy task, but in this example we will simply use the Better Life Index data from the OECD's website:

http://stats.oecd.org/index.aspx?DataSetCode=BLI

as well as stats about GDP per capita from the IMF's website:

https://www.imf.org/external/datamapper/NGDPDPC@WEO/OEMDC/ADVEC/WEOWORLD

Download these data as BLI.csv and GDP.csv (for the latter one you need to use Excel to save the original file as a .csv file)

### Step 2: Load the data
- Use read_csv() from pandas library to load the csv files as pandas DataFrame.
- Get to know the data: check the column names, shape of dataset, data types. Use value_counts() to show frequencies for categorical data. Use hist() to draw histograms for numerical data.
- Extract the average Life Satisfaction value for each country.

In [None]:
# Find the directory of the Data folder
import os
cur_path = os.getcwd()
os.listdir(cur_path)

In [None]:
datapath = cur_path + '/Data/'
# OR
datapath = os.path.join(cur_path, 'Data/')
os.listdir(datapath)

In [None]:
import pandas as pd
pd.__version__

In [None]:
# Use a data frame to store values from BLI.csv
bli = pd.read_csv(datapath + 'BLI.csv')

In [None]:
# Show the first few rows of bli
bli.head(10)

In [None]:
# What are the columns?
bli.columns

In [None]:
# What are the data types?
bli.dtypes

In [None]:
# We can manually change the data types using astype()
# bli['PowerCode Code'] = bli['PowerCode Code'].astype('int64')

In [None]:
# Look at the first feature 'LOCATION'
bli['Indicator'].value_counts()

In [None]:
# Extract Life Satisfaction part from the dataset
new_bli = bli[bli['Indicator'] == 'Life satisfaction']
new_bli['Indicator'].value_counts()

In [None]:
new_bli['INEQUALITY'].value_counts()

In [None]:
# from new_bli, extract the values associated with INEQUALITY being TOT.
new_bli2 = new_bli[new_bli['INEQUALITY'] == 'TOT']
new_bli2['INEQUALITY'].value_counts()

In [None]:
new_bli2.head()

In [None]:
# Create a data frame containing only 'Country' and 'Value'
final_bli = new_bli2[['Country', 'Value']]
final_bli.head()

In [None]:
# Set Country to be the index of the data frame
final_bli = final_bli.set_index('Country')
final_bli.head()

In [None]:
# Rename the Value column as BLI (Later we will create another column called GDP)
final_bli = final_bli.rename(columns={'Value':'BLI'})
final_bli.head()

In [None]:
# Use pandas to draw a historgram for the BLI
%matplotlib inline
final_bli['BLI'].hist()

import matplotlib.pyplot as plt
plt.hist(final_bli['BLI'])

### Interlude: An introduction to pandas DataFrames
- How to create a dataframe
- Computing descriptive statistics
- Column / row slicing
- How to add new columns

In [None]:
# Create a numpy array with data
import numpy as np
data = np.array([['Alice', 24, 'A'],
                 ['Bob', 25, 'B'],
                 ['Clare', 24, 'C'],
                 ['Doug', 26, 'C']
                ])
print(data)

In [None]:
# load data as a pandas DataFrame
df = pd.DataFrame(data=data,
                      columns=['Name', 'Age', 'Grade'],
                      index=[10001, 10002, 10003, 10004])

In [None]:
df.head()

In [None]:
df2 = pd.DataFrame(df, copy=True)
df['Name'][10001] = 'abc'
df2['Name'][10001]

In [None]:
# Things you can quickly get from a data frame
print('columns:', df.columns)
print('data types:', df.dtypes)
df['Age'] = df['Age'].astype('int64')
print('frequencies of feature Name:', df['Name'].value_counts())
print('average age:', df['Age'].mean())
print('variance of ages:', df['Age'].var())
df['Grade'].hist()
print('shape of data frame:', df.shape)
df['Age'].describe()

In [None]:
df

In [None]:
df[['Name', 'Age']]

In [None]:
# Extract the two rows in the middle
df.iloc[1:3, 0:2]

In [None]:
df.loc[10001:10003, ['Age', 'Grade']]

In [None]:
df

In [None]:
# Create a new column who's value is the difference between age and the average
# age
df['Age Difference'] = df['Age'] - df['Age'].mean()
df

In [None]:
# Create a new column called 'Passed' to detect whether the grade is B or above
def is_passed(x):
    return x == 'A' or x == 'B'

df['Passed'] = df['Grade'].apply(is_passed)
df['Passed2'] = df['Grade'].apply(lambda x: (x == 'A' or x == 'B'))
df

In [None]:
df['Failed'] = df['Passed'].apply(lambda x : not x)
df['Failed2'] = df['Grade'].apply(lambda x: not (x == 'A' or x == 'B'))
df

In [None]:
# NEXT: load the 2018 GDP per capita data from GDP.csv as a DataFrame gdp.
# Do some data exploration.

In [None]:
os.getcwd()

In [None]:
print(datapath)
os.listdir(datapath)

In [None]:
gdp = pd.read_csv(datapath + 'GDP.csv', sep=',')

In [None]:
gdp.head()

In [None]:
gdp.columns

In [None]:
# Change the second column's label to 'Country'
gdp.rename(columns={gdp.columns[0]:'Country'}, inplace=True)
gdp.head()

In [None]:
# Keep only the country column and the 2018 column
gdp = gdp[['Country', '2018']]
gdp.head()

In [None]:
gdp.set_index('Country', inplace=True)
gdp.head()

In [None]:
gdp.rename(columns={'2018':'GDP'}, inplace=True)
gdp.head()

In [None]:
final_bli.head()

In [None]:
data = pd.merge(gdp, final_bli, left_index=True, right_index=True)
data.head()

## Visualize the correlation

In [None]:
data['GDP'] = data['GDP'].astype(np.float64)
data.dtypes

In [None]:
data.sort_values(by='GDP', inplace=True)
data.head(100)

In [None]:
# Plot GDP vs. BLI
plt.plot(data['GDP'], data['BLI'],'.')

## Use linear regression to characterize the trend


In [None]:
# Use LinearRegression from Sci-Kit Learn
from sklearn.linear_model import LinearRegression
model = LinearRegression()  # Create a linear regression model
model.fit(data[['GDP']], data['BLI'])  # Fit model to data

In [None]:
# Check the coefficients of the line
m = model.coef_[0]
b = model.intercept_
print('m, b:', m, b)

In [None]:
# Plot the regression line together with data
plt.plot(data["GDP"], data['BLI'], 'b.')

# draw the line
xs = np.arange(0, 120000, 1000)
ys = m * xs + b
plt.plot(xs, ys, 'g-')

plt.plot([100000], [10.0], 'r.')

In [None]:
# Split the data into training set and test set
countries = ['Turkey', 'Germany']
data_train = data[data.index != 'Turkey']
data_train = data_train[data_train.index != 'Germany']
data_train.head(100)

In [None]:
# Train the linear regression model on data_train
model2 = LinearRegression()
model2.fit(data_train[['GDP']], data_train['BLI'])

In [None]:
data.loc['Turkey', 'BLI']

In [None]:
model2.predict(data.loc['Turkey', 'GDP'])

In [None]:
data.loc['Germany', 'BLI']

In [None]:
model2.predict(data.loc['Germany', 'GDP'])

## Week 2 Homework
Please answer the first 8 questions after Chapter 1.