This notebook is a very basic introduction to the NumPy, Pandas, and Scikit-Learn libraries, which form the basics of any data analysis in Python.

To learn more about these libraries, try the official tutorials for each one:

*   [NumPy basics](https://numpy.org/devdocs/user/quickstart.html)
*   [Pandas in 10 minutes](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
*   [Scikit-learn official tutorial](https://scikit-learn.org/stable/tutorial/index.html)



# NumPy

NumPy is a library for handling *multidimensional arrays* (i.e. matrices).

This, in turn, lets us do things like linear algebra, which is a key part of most machine learning and scientific computing.

## What is an array/matrix?

In [None]:
# Numbers in Python:
n = 5

n * n

In [None]:
# A Python list:
l = [1,2,3]

l

In [None]:
#A Python list of lists:
lol = [
       [1,2,3],
       [4,5,6],
       [7,8,9]
]

lol

However, we cannot do much math with lists of numbers (or lists of lists of numbers).

#### Exercise

For example try adding or multiplying `l` or `lol` by themselves:

```python
l + l

l * l
```

## Creating arrays in NumPy

We need to import the `numpy` library. We usually abbreviate it as `np` for convenience.

In [None]:
import numpy as np

In [None]:
# A 1 dimensional array:

array_1d = np.array(l)

array_1d

In [None]:
array_1d * array_1d

In [None]:
# A 2 dimensional array

array_2d = np.array(lol)

array_2d

In [None]:
array_2d * array_2d

## Creating arrays

Of ones, zeros, and random numbers

In [None]:
# An array of 1's

np.ones((2,4))

In [None]:
# An array of 0's

np.zeros((4,2))

In [None]:
# An array of random numbers

np.random.random((3,3))

In [None]:
# Or if we want just integers

np.random.randint(0,10, size = (2,3))

## Indexing into NumPy arrays

We index (get a specific value) using the square bracket notation:

`array_2d[row_index, column_index]`

Don't forget that Python is *zero-indexed* i.e. we refer to the first row/column as 0

In [None]:
print(array_2d)

print(array_2d[2,2])

#### Exercise

How would you retrieve the 3 from `array_2d`?

In [None]:
# Your answer here

## Slicing NumPy arrays

We can also select a subset of array by slicing (which also uses square brackets, as well as colons `:`)

In [None]:
array_2d[0:2,2]

In [None]:
array_2d[0:2,1:]

# Pandas

Pandas is a library that allows us to store data in tables called **dataframes**.

How is this different from a 2D array in NumPy?

*   A Pandas dataframe allows us to have things like column names
*   A NumPy array needs to have the same type of data in the entire array (e.g. just integers, or just decimals). However a Pandas dataframe can have a different type of data in each column.

By convention we import the `pandas` library as the abbreviation `pd`

In [None]:
import pandas as pd

In [None]:
# Create a dataframe from a numpy array

pd.DataFrame(lol)

In [None]:
# We can specify column names:

df = pd.DataFrame(lol, columns = ["A", "B", "C"])

df

In [None]:
# Dataframes can be converted back to numpy arrays
df.to_numpy()

Once we have a dataframe, we can do a lot of useful things with it:

In [None]:
df.describe()

In [None]:
df.plot(x = "A", y = "B", kind = "scatter")

## Indexing and slicing Pandas dataframes

There are two ways to select data from a Pandas dataframe:

* by index (i.e. the number of the row and column)

* by label (i.e. the name of the row and column)

In [None]:
# Label based selection
# Note that we need the row first, then the column
df.loc[:,"A"]

In [None]:
# Index based selection
df.iloc[:,0]

We can also get one or more columns more simply:

In [None]:
df["A"]

In [None]:
df[["A", "C"]]

We can also do things like fill in missing data:

In [None]:
missing = pd.DataFrame(
    {"x1": [4,8,np.nan, 5],
     "x2": [0.543, np.nan, 0.213, 0.65]}
)

missing

In [None]:
# Replace missing values with a particular value, e.g. 0
missing.fillna(0)

In [None]:
# Fill with a particular value for 

missing.fillna(missing.median())

#### Exercises

i. What is the result of running `missing.median()` by itself?

ii. How do you think you could fill the missing values in `missing` with the mean instead of the median?

# Scikit-learn

Scikit-learn is *the* Python library for doing machine learning.

It includes a lot of tools for processing data, as well as all the most common models that we want to use for making predictions.

Most of the tools in `sklearn` are classes, and they implement a similar set of methods.

For example, the classes for models (e.g. linear regression or random forests) are referred to as *estimators*, which means they have a `fit()` method and a `predict()` method.

In [None]:
# Here we import the built-in iris dataset.
from sklearn import datasets
iris = datasets.load_iris()

iris.keys()

In [None]:
# The target (response variable, i.e. y) is in the target object

# The features (explanatory variables) are in the data object
iris.data

In [None]:
# Then we import a classification model, in this case a random forest.
from sklearn.ensemble import RandomForestClassifier

# First we have to create an instance of the class.
# At this point we could set parameters like the number of trees in the forest.
rf = RandomForestClassifier()

# Then we train the random forest on the training data using the fit() method
rf.fit(iris.data, iris.target)

In [None]:
# Information about the best model is now stored in rf
print(iris.feature_names)
print(rf.feature_importances_)

In [None]:
# We can make predictions using this best model:

rf.predict(iris.data)

## Dividing our data into training and test sets

`sklearn` has a convenient function to randomly divide the data for us: `train_test_split`

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.2)

## Cross validating models to find the optimum hyperparameter settings

Hyperparameters are parameters of the algorithm that the model cannot learn by itself.

For example, random forests can find a good set of random trees, but it cannot determine *how many* trees to create.

The number of trees is a hyperparameter. How do we find the optimal number of trees?

In [None]:
# Option 1
# Run cross-validation yourself with the cross_validate function

from sklearn.model_selection import cross_validate

rf = RandomForestClassifier(n_estimators=5)
scores = cross_validate(rf, X_train, y_train, cv = 5)
print("Scores:", scores)
print("Average CV score: ", scores['test_score'].mean())

In [None]:
# ... and then try with a different number of trees

rf = RandomForestClassifier(n_estimators=50)
scores = cross_validate(rf, X_train, y_train, cv = 5)

print("Scores:", scores)
print("Average CV score: ", scores['test_score'].mean())

Alternatively we can automatically search for the best values for a hyperparameter:

In [None]:
# Option 2
# Automatically search for the best values of hyperparamters

from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier()
parameter_grid = {"n_estimators": [1,3,5,10,20], "max_depth": [1, 2, 3, None]}

grid_search = GridSearchCV(rf, parameter_grid, cv = 5)

grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)

Finally, once we have our best combination of hyperparameters, we want to make predictions on the test set and see how well we do:

In [None]:
# The grid_search saves the best model
# We can use the score method to make predictions and assess them.

grid_search.best_estimator_.score(X_test, y_test)

#### Exercises

Try some other classification models on the `iris` data. Can you do better than the Random Forest model?

Here are several other classification algorithms, and their important hyperparameters that you can tune:

* [K-Nearest Neaighbors (KNN)](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)
    * Number of neighbors is a hyparameter you will need to tune
* [Logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)
    * Try varying the C hyperparameter and the `penalty` hyperparameter for different strengths and types of regularization respectively.
* [Support Vector Machine (SVM)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
    * Vary the C parameter to change the strength of regularization.