# Getting Datasets

## Using the Scikit-learn Dataset

In [None]:
from sklearn import datasets
iris = datasets.load_iris()   # raw data of type Bunch


**TIP**: The Iris flower dataset or Fisher’s Iris dataset is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher. The dataset consists of
50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width
of the sepals and petals in centimeters. Based on the combination of these four
features, Fisher developed a linear discriminant model to distinguish the species
from each other.

In [None]:
print(iris.DESCR)


In [None]:
print(iris.data)               # Features


In [None]:
print(iris.feature_names)      # Feature Names


In [None]:
print(iris.target)             # Labels
print(iris.target_names)       # Label names


In [None]:
import pandas as pd
df = pd.DataFrame(iris.data)   # convert features
                               # to dataframe in Pandas
print(df.head())


In [None]:
# data on breast cancer
breast_cancer = datasets.load_breast_cancer()

# data on diabetes
diabetes = datasets.load_diabetes()

# dataset of 1797 8x8 images of hand-written digits
digits = datasets.load_digits()


For more information on the Scikit-learn dataset, check out the documentation at http://scikit-learn.org/stable/datasets/index.html.

## Kaggle Dataset

Kaggle is the world’s largest community of data scientists and machine learners.
What started off as a platform for offering machine learning competitions, Kaggle
now also offers a public data platform, as well as a cloud-based workbench for
data scientists. Google acquired Kaggle in March 2017.

For learners of machine learning, you can make use of the sample datasets
provided by Kaggle at https://www.kaggle.com/datasets/. Some of the interesting datasets include:

- Women’s Shoe Prices: A list of 10,000 women’s shoes and the prices at which they are sold (https://www.kaggle.com/datafiniti/womensshoes-prices)

- Fall Detection Data from China: Activity of elderly patients along with their medical information (https://www.kaggle.com/pitasr/falldata)

- NYC Property Sales: A year’s worth of properties sold on the NYC real
estate market (https://www.kaggle.com/new-york-city/nyc-propertysales#nyc-rolling-sales.csv)

- US Flight Delay: Flight Delays for year 2016 (https://www.kaggle.com/niranjan0272/us-flight-delay)


# Generating Your Own Dataset

## Linearly Distributed Dataset

The `make_regression()` function generates data that is linearly distributed.
You can specify the number of features that you want, as well as the standard
deviation of the Gaussian noise applied to the output:

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
from sklearn.datasets.samples_generator import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=5.4)
plt.scatter(X,y)


## Clustered Dataset

The `make_blobs()` function generates n number of clusters of random data. This
is very useful when performing clustering in unsupervised learning 

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_blobs

np.random.seed(10)

X, y = make_blobs(500, centers=3)  # Generate isotropic Gaussian
                                   # blobs for clustering

rgb = np.array(['r', 'g', 'b'])

# plot the blobs using a scatter plot and use color coding
plt.scatter(X[:, 0], X[:, 1], color=rgb[y])


## Clustered Dataset Distributed in Circular Fashion

The `make_circles()` function generates a random dataset containing a large circle
embedding a smaller circle in two dimensions. This is useful when performing
classifications, using algorithms like SVM (Support Vector Machines)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=100, noise=0.09)

rgb = np.array(['r', 'g', 'b'])
plt.scatter(X[:, 0], X[:, 1], color=rgb[y])


# Getting Started with Scikit-learn

The easiest way to get started with machine learning with Scikit-learn is to start
with linear regression. *Linear regression* is a linear approach for modeling the
relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables). For example, imagine that you have
a set of data comprising the heights (in meters) of a group of people and their
corresponding weights (in kg):

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# represents the heights of a group of people in metres
heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]

# represents the weights of a group of people in kgs
weights = [[60], [65], [72.3], [75], [80]]

plt.title('Weights plotted against heights')
plt.xlabel('Heights in metres')
plt.ylabel('Weights in kilograms')

plt.plot(heights, weights, 'k.')

# axis range for x and y
plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)


**TIP**: Observe that the *heights* and *weights* are both represented as
two-dimensional lists. This is because the `fit()` function requires both the X and y
arguments to be two-dimensional (of type `list` or `ndarray`).

## Using the LinearRegression Class for Fitting the Model

So how do we draw the straight line that cuts though all of the points? It turns
out that the Scikit-learn library has the `LinearRegression` class that helps you
to do just that. All you need to do is to create an instance of this class and use the *heights* and *weights* lists to create a linear regression model using the `fit()`
function, like this:

In [None]:
from sklearn.linear_model import LinearRegression

# Create and fit the model
model = LinearRegression()
model.fit(X=heights, y=weights)


## Making Predictions

In [None]:
# make prediction
weight = model.predict([[1.75]])[0][0]
print(round(weight,2))         # 76.04


**TIP**: In Scikit-learn, you typically use the `fit()` function to train a model. Once the
model is trained, you use the `predict()` function to make a prediction.

## Plotting the Linear Regression Line

It would be useful to visualize the linear regression line that has been created
by the *LinearRegression* class. Let’s do this by first plotting the original data
points and then sending the *heights* list to the model to predict the weights.
We then plot the series of forecasted weights to obtain the line. The following
code snippet shows how this is done:

In [None]:
import matplotlib.pyplot as plt

heights = [[1.6], [1.65], [1.7], [1.73], [1.8]]
weights = [[60], [65], [72.3], [75], [80]]

plt.title('Weights plotted against heights')
plt.xlabel('Heights in metres')
plt.ylabel('Weights in kilograms')

plt.plot(heights, weights, 'k.')

plt.axis([1.5, 1.85, 50, 90])
plt.grid(True)

# plot the regression line
plt.plot(heights, model.predict(heights), color='r')


## Getting the Gradient and Intercept of the Linear Regression Line

it is not clear at what value the linear regression line intercepts
the y-axis. This is because we have adjusted the x-axis to start plotting at 1.5. A
better way to visualize this would be to set the x-axis to start from 0 and enlarge
the range of the y-axis. You then plot the line by feeding in two extreme values
of the height: 0 and 1.8. The following code snippet re-plots the points and the
linear regression line:

In [None]:
plt.title('Weights plotted against heights')
plt.xlabel('Heights in metres')
plt.ylabel('Weights in kilograms')

plt.plot(heights, weights, 'k.')

plt.axis([0, 1.85, -200, 200])
plt.grid(True)

# plot the regression line
extreme_heights = [[0], [1.8]]
plt.plot(extreme_heights, model.predict(extreme_heights), color='b')


While you can get the y-intercept by predicting the weight if the height is 0:

In [None]:
round(model.predict([[0]])[0][0],2)   # -104.75


the model object provides the answer directly through the intercept_ property:

In [None]:
print(round(model.intercept_[0],2))   # -104.75


Using the model object, you can also get the gradient of the linear regression
line through the coef_ property:

In [None]:
print(round(model.coef_[0][0],2))     # 103.31


## Examining the Performance of the Model by Calculating the Residual Sum of Squares

To know if your linear regression line is well fitted to all of the data points, we
use the *Residual Sum of Squares* (RSS) method.

In [None]:
import numpy as np

print('Residual sum of squares: %.2f' %
       np.sum((weights - model.predict(heights)) ** 2))


The RSS should be as small as possible, with 0 indicating that the regression line fits the points exactly (rarely achievable in the real world).

## Evaluating the Model Using a Test Dataset

In [None]:
# test data
heights_test = [[1.58], [1.62], [1.69], [1.76], [1.82]]
weights_test = [[58], [63], [72], [73], [85]]


In [None]:
# Total Sum of Squares (TSS)
weights_test_mean = np.mean(np.ravel(weights_test))
TSS = np.sum((np.ravel(weights_test) -
              weights_test_mean) ** 2)
print("TSS: %.2f" % TSS)

# Residual Sum of Squares (RSS)
RSS = np.sum((np.ravel(weights_test) -
              np.ravel(model.predict(heights_test)))
                 ** 2)
print("RSS: %.2f" % RSS)

# R_squared
R_squared = 1 - (RSS / TSS)
print("R-squared: %.2f" % R_squared)


**TIP**: The ravel() function converts the two-dimensional list into a contiguous
flattened (one-dimensional) array.

Fortunately, you don’t have to calculate the R-Squared manually yourself—
Scikit-learn has the score() function to calculate the R-Squared automatically
for you:

In [None]:
# using scikit-learn to calculate r-squared
print('R-squared: %.4f' % model.score(heights_test,
                                      weights_test))

# R-squared: 0.9429


An R-Squared value of 0.9429 (94.29%) indicates a pretty good fit for your test data.

## Persisting the Model

Once you have trained a model, it is often useful to be able to save it for later
use. Rather than retraining the model every time you have new data to test, a
saved model allows you to load the trained model and make predictions immediately without the need to train the model again.

There are two ways to save your trained model in Python:
- Using the standard pickle module in Python to serialize and deserialize objects
- Using the joblib module in Scikit-learn that is optimized to save and load Python objects that deal with NumPy data

The first example you will see is saving the model using the pickle module:

In [None]:
import pickle

# save the model to disk
filename = 'HeightsAndWeights_model.sav'
# write to the file using write and binary mode
pickle.dump(model, open(filename, 'wb'))


In the preceding code snippet, you first opened a file in "wb" mode ("w" for
write and "b" for binary). You then use the dump() function from the pickle
module to save the model into the file. 

To load the model from file, use the load() function:

In [None]:
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))


In [None]:
result = loaded_model.score(heights_test,
                            weights_test)
print(result)

Using the joblib module is very similar to using the pickle module

In [None]:
from sklearn.externals import joblib

# save the model to disk
filename = 'HeightsAndWeights_model2.sav'
joblib.dump(model, filename)

# load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(heights_test,
                            weights_test)
print(result)


# Data Cleansing

## Cleaning Rows with NaNs

In [None]:
import pandas as pd
df = pd.read_csv('https://drive.google.com/uc?id=1YXCQx-K7o4ITIRl0TX9L_4VtQGtPnxSs')
df.isnull().sum()


In [None]:
print(df)

### Replacing NaN with the Mean of the Column

In [None]:
# replace all the NaNs in column B with the average of column B
df.B = df.B.fillna(df.B.mean())
print(df)


### Removing Rows

In [None]:
df = pd.read_csv('https://drive.google.com/uc?id=1YXCQx-K7o4ITIRl0TX9L_4VtQGtPnxSs')
df = df.dropna()                             # drop all rows with NaN
print(df)


Observe that after removing the rows containing NaN, the index is no longer in
sequential order. If you need to reset the index, use the reset_index() function:

In [None]:
df = df.reset_index(drop=True)               # reset the index
print(df)


## Removing Duplicate Rows

In [None]:
import pandas as pd
df = pd.read_csv('https://drive.google.com/uc?id=16xLZ4HBsf6WK5TJg5MuLx30lUMETln9W')
print(df)
print("\n")
print(df.duplicated(keep=False))


The keep argument allows you to specify how to indicate
duplicates:
- The default is 'first': All duplicates are marked as True except for the
first occurrence
- 'last': All duplicates are marked as True except for the last occurrence
- False: All duplicates are marked as True

In [None]:
print(df.duplicated(keep="first"))


In [None]:
print(df[df.duplicated(keep=False)])


In [None]:
df.drop_duplicates(keep='first', inplace=True)  # remove duplicates and keep the first
print(df)


**TIP**: By default, the drop_duplicates() function will not modify the original
dataframe and will return the dataframe containing the dropped rows. If you want to
modify the original dataframe, set the inplace parameter to True, as shown in the
preceding code snippet.

Sometimes, you only want to remove duplicates that are found in certain
columns in the dataset. For example, if you look at the dataset that we have been using, observe that for row 3 and row 4, the values of column A and C are identical. You can remove duplicates in certain columns by specifying the subset
parameter:

In [None]:
df.drop_duplicates(subset=['A', 'C'], keep='last',
                           inplace=True)     # remove all duplicates in
                                             # columns A and C and keep
                                             # the last
print(df)


**TIP**: To remove all duplicates, set the keep parameter to False. To keep the last
occurrence of duplicate rows, set the keep parameter to 'last'.

## Normalizing Columns

Normalization is crucial for some algorithms to model the data correctly. For
example, one of the columns in your dataset may contain values from 0 to 1,
while another column has values ranging from 400,000 to 500,000. The huge
disparity in the scale of the numbers could introduce problems when you use
the two columns to train your model. Using normalization, you could maintain the ratio of the values in the two columns while keeping them to a limited
range. In Pandas, you can use the MinMaxScaler class to scale each column to
a particular range of values.

In [None]:
import pandas as pd
from sklearn import preprocessing

df = pd.read_csv('https://drive.google.com/uc?id=1bSUZBtBV37KAv9f2LAw5AHV3QZzLYgQU')
print(df)

x = df.values.astype(float)

min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled, columns=df.columns)
print(df)


## Removing Outliers

In statistics, an outlier is a point that is distant from other observed points.
For example, given a set of values—234, 267, 1, 200, 245, 300, 199, 250, 8999, and
245—it is quite obvious that 1 and 8999 are outliers. They distinctly stand out
from the rest of the values, and they “lie outside” most of the other values in the
dataset; hence the word outlier. Outliers occur mainly due to errors in recording
or experimental error, and in machine learning it is important to remove them
prior to training your model as it may potentially distort your model if you don’t.
There are a number of techniques to remove outliers, and in this chapter we
discuss two of them:
- Tukey Fences
- Z-Score

### Tukey Fences

Tukey Fences is based on Interquartile Range (IQR). IQR is the difference between
the first and third quartiles of a set of values. The first quartile, denoted Q1,
is the value in the dataset that holds 25% of the values below it. The third quartile, denoted Q3, is the value in the dataset that holds 25% of the values above it. Hence, by definition, IQR = Q3 – Q1.

In Tukey Fences, outliers are values that are as follows:
- Less than Q1 – (1.5 × IQR), or
- More than Q3 + (1.5 × IQR)

In [None]:
import numpy as np

def outliers_iqr(data):
    q1, q3 = np.percentile(data, [25, 75])
    iqr = q3 - q1
    lower_bound = q1 - (iqr * 1.5)
    upper_bound = q3 + (iqr * 1.5)
    return np.where((data > upper_bound) | (data < lower_bound))


**TIP**: The np.where() function returns the location of items satisfying the conditions.

To test the Tukey Fences, let’s use the famous Galton dataset on the heights
of parents and their children. The dataset contains data based on the famous
1885 study of Francis Galton exploring the relationship between the heights of
adult children and the heights of their parents. Each case is an adult child, and
the variables are as follows:
- Family: The family that the child belongs to, labeled by the numbers from 1 to 204 and 136A
- Father: The father’s height, in inches
- Mother: The mother’s height, in inches
- Gender: The gender of the child, male (M) or female (F)
- Height: The height of the child, in inches
- Kids: The number of kids in the family of the child

The dataset has 898 cases.

In [None]:
import pandas as pd
df = pd.read_csv("http://www.mosaic-web.org/go/datasets/galton.csv")
print(df.head())


In [None]:
print("Outliers using outliers_iqr()")
print("=============================")
for i in outliers_iqr(df.height)[0]:
    print(df[i:i+1])


### Z-Score

The second method for determining outliers is to use the Z-score method. A
Z-score indicates how many standard deviations a data point is from the mean.
The Z-score has the following formula: 

$$Z =\frac{(x_i - \mu)}{\sigma}$$

where $x_i$ is the data point, $\mu$ is the mean of the dataset, and $\sigma$ is the standard
deviation.
This is how you interpret the Z-score:
- A negative Z-score indicates that the data point is less than the mean, and
a positive Z-score indicates the data point in question is larger than
the mean
- A Z-score of 0 tells you that the data point is right in the middle (mean),
and a Z-score of 1 tells you that your data point is 1 standard deviation
above the mean, and so on
- Any Z-score greater than 3 or less than –3 is considered to be an outlier

In [None]:
def outliers_z_score(data):
    threshold = 3
    mean = np.mean(data)
    std = np.std(data)
    z_scores = [(y - mean) / std for y in data]
    return np.where(np.abs(z_scores) > threshold)


In [None]:
print("Outliers using outliers_z_score()")
print("=================================")
for i in outliers_z_score(df.height)[0]:
    print(df[i:i+1])
