# Python for n00bs

## Workshop 3: Python for Data Science
Welcome to the last workshop of the series! My name is Vikram Mark Radhakrishnan. You can find me on [LinkedIn](https://www.linkedin.com/in/vikram-mark-radhakrishnan-90038660/), or reach me via email at radhakrishnan@strw.leidenuniv.nl

Shout-out to the [AI Lab One](https://www.meetup.com/AI-Lab/) and [City AI](https://city.ai/) for making this workshop possible!
<img src="nb_images/AI_Lab.png">  
In today's workshop, we are going to take a look at several different libraries written in Python, that are requisite for every data scientist. These libraries facilitate cleaning and pre-processing of data, making exploratory data analysis and visualizations, building simple machine learning models and training, validating, and testing them.

### 1. NumPy - building a strong foundation
The [NumPy](https://www.numpy.org/) library is ubiquitous in data science, as well as a whole bunch of other fields. It enables powerful computing with N-dimensional arrays, and faster calculations due to storage efficient data structures.

In [None]:
import numpy as np

Let's start with creating a bunch of numpy objects. These objects are called arrays, and can have 0 or more dimensions.

In [None]:
scalar = np.array([42]) # A zero-dimensional array is called a scalar
vector = np.array([4, 8, 15, 16, 23, 42]) # A one-dimensional array is called a vector
matrix = np.array([[2, 7, 6], [9, 5, 1], [4, 3, 8]]) # A two-dimensional array is called a matrix
tensor = np.array([[[8, 24, 10], [12, 7, 23], [22, 11, 9]],
                   [[15, 1, 26], [25, 14, 3], [2, 27, 13]],
                   [[19, 17, 6], [5, 21, 16], [18, 4, 20]]]) # An array with higher than 3 dimensions is a tensor

Useful to know: You can use the Python operator @ to take the dot product of two matrices, or you can use np.dot(M1, M2). To take the elementwise product, use np.multiply. And to take the transpose of a matrix you can use M.T or np.transpose(M).

**ToDo:** Write a function to check if the transpose of a numpy matrix is equal to its inverse, i.e. A.A^t = I

More useful numpy methods:

In [None]:
ones_matrix = np.ones([2,2])
zeros_matrix = np.zeros([2,2])
empty_matrix = np.empty([2,2])
full_matrix = np.full((2,2), 7)
identity_matrix = np.eye(2)
random_matrix = np.random.random((2,2))

range_of_values = np.arange(1.3, 4.7, 0.2)

Indexing numpy arrays is identical to indexing Python lists.

### 2. Visualizing data with Matplotlib
The [matplotlib.pyplot](https://matplotlib.org) library is one of the most widely used plotting and visulization tools used with Python. You can plot figures and graphs that look very similar to the ones generated with MATLAB code with this library, and which are aesthetically appealing and comprehensible.

The code in this section can also be found in this [official tutorial](https://matplotlib.org/users/pyplot_tutorial.html).

In [None]:
import matplotlib.pyplot as plt

Let's take a look at some simple 2d plots with plt.

In [None]:
#Just a line through some numbers
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()

In [None]:
# The x and y values of some coordinates, denoted with red dots
plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.axis([0, 6, 0, 20])
plt.show()

In [None]:
# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
plt.show()

In [None]:
# Change the width of the line
x = np.array([1,2,3,4])
y = np.array([1,4,9,16])
plt.plot(x, y, lw=2.0)
plt.plot(x, y**2, c='g')

When making multiple plots in the same figure, we need to use the plt.subplot feature. The parameter passed to plt.subplot is three digits. The first digit is the total number of subplots needed. The second digit is the number of rows of subplots, and the third digit is the number of columns of subplots.

In [None]:
# Let's make a figure with multiple axes
def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(1)
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
plt.show()

In [None]:
# Let's make multiple figures with multiple axes
plt.figure(1)                # the first figure
plt.subplot(211)             # the first subplot in the first figure
plt.plot([1, 2, 3])
plt.subplot(212)             # the second subplot in the first figure
plt.plot([4, 5, 6])


plt.figure(2)                # a second figure
plt.plot([4, 5, 6])          # creates a subplot(111) by default

plt.figure(1)                # figure 1 current; subplot(212) still current
plt.subplot(211)             # make subplot(211) in figure1 current
plt.title('Easy as 1, 2, 3') # subplot 211 title

You can clear the current figure using plt.clf() and the current axis using plt.cla().

In [None]:
# Adding text to the plots

np.random.seed(19680801)

mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50, density=1, facecolor='g', alpha=0.75)


plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.title(r'$\sigma_i=15$')
plt.show()

In [None]:
# Here's how we annotate with text
ax = plt.subplot(111)

t = np.arange(0.0, 5.0, 0.01)
s = np.cos(2*np.pi*t)
line, = plt.plot(t, s, lw=2)

plt.annotate('local max', xy=(2, 1), xytext=(3, 1.5),
            arrowprops=dict(facecolor='black', shrink=0.05),
            )

plt.ylim(-2,2)
plt.show()

You can change the scaling of the axes using plt.xscale and plt.yscale. You can choose among linear, logarithmic, or logit scales.

**ToDo:** Make a list of peoples heights and weights and plot them out, and label the points by the persons' names.

In [None]:
names = []
heights = []
weights = []

plt.plot()
for person in range(len(names)):
    plt.text()
plt.show()

### 3. Tabulating data with pandas
The [pandas](https://pandas.pydata.org) library provides powerful data structures for efficient handling of tabulated or labeled data. It has built in methods to deal with missing data, to differentiate between numeric data, categorical data, and date/timestamps.

The code in this section can also be found in this [tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#min).

In [None]:
import pandas as pd

In [None]:
# Create a series of data and store it as a pandas dataframe
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

In [None]:
# Create a dataframe using a numpy array, with a datetime index and labeled columns
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

In [None]:
# Or use a dict of objects that can be converted into a series
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
df2

In [None]:
df2.dtypes

The head and tail methods are useful for viewing small, relevant subsets of your dataframe.  
While the index and columns methods are useful for separating the actual data (columns) from their timestamps or dates or index.  
The describe method shows a statistical summary of your data.  
Or you can use descriptive statistics such as df.mean() or df.median() for a column by column calculation.

You can use the to_numpy() function to convert a pandas dataframe into a numpy 2-dimensional array.

You can sort your data by a particular axis, or by a particular column's values.

In [None]:
df.sort_index(axis=1, ascending=False)

In [None]:
df.sort_values(by='B')

Pandas dataframes are sort of similar to Python dictionaries, in the way that you can select and refer to data. Here are some ways of selecting data from a pd dataframe:

In [None]:
# For getting a cross section using a label
df.loc[dates[0]]

In [None]:
# Selecting on a multi-axis by label:
df.loc[:, ['A', 'B']]

In [None]:
# Label slicing
df.loc['20130102':'20130104', ['A', 'B']]

In [None]:
# Selecting values from a dataframe by their positions
df.iloc[3:5, 0:2]

In [None]:
# Or by explicitly mentioning indexes
df.iloc[[1, 2, 4], [0, 2]]

In [None]:
# Or using Boolean conditions to slice/subset a dataframe
df[df.A > 0]

In [None]:
#Setting anew column
E = np.arange(6)
df['E'] = E
df

Pandas offers many elegant ways to deal with missing data. You can drop missing data using df.dropna(how=...) or you can fill in missing data with df.fillna(value=...)

In [None]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
df.sub(s, axis='index')

In [None]:
# Breaking a dataframe into pieces and merging it with concat
df = pd.DataFrame(np.random.randn(10, 4))
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)

In [None]:
# Appending data to a dataframe
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
s = df.iloc[3]
df.append(s, ignore_index=True)

Now let's take a look at grouping data. This is a blanket term to cover
* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure

In [None]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})

df.groupby(['A', 'B']).sum()

Pandas provides ways to efficiently handle categorical data. Let's take a look:

In [None]:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
                   "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df["grade"].cat.categories = ["very good", "good", "very bad"]

df

In [None]:
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])

You can save dataframes in csv files using the df.to_csv(...) method.

We can read csv files using URLs with pandas. For example, let's take a look at some [Housing data from Amsterdam](data.amsterdam.nl) with pandas.

In [None]:
url = 'https://api.data.amsterdam.nl/dataselectie/bag/export/?openbare_ruimte=Kalverstraat'
df1 = pd.read_csv(url, sep=";")

**ToDo**: Experiment with your own datasets.

### 4. Learning from data with scikit-learn
The [scikit-learn](https://scikit-learn.org) library is a standard machine learning resource for a Python programming data scientist. It includes methods to partition data into training, test, and validation sets, and several machine learning models.

The code in this section can also be found in this [tutorial](https://www.datacamp.com/community/tutorials/machine-learning-python).

To understand sklearn, we will go through an example of loading a dataset, doing some exploratory analysis on it, and then training a machine learning model on it to function as a predictor.

Our first task is to load the dataset. We will use one of the example datasets available to us in the scikit-learn library, called "digits". The digits dataset is a collection of handwritten digits, each image being 8 pixels by 8 pixels. When loaded as a dataset, each row of the dataset represents one image, and each of the 64 columns is the value of one pixel in this image.

In [None]:
from sklearn import datasets

#Load the digits dataset
digits = datasets.load_digits()

#Display the first digit
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

We will now do some preliminary exploratory analysis on our data. This will give us an idea of what our data looks like and how we should go about building a classifier with this data.

In [None]:
# Get the keys of the `digits` data
print(digits.keys())

# Print out the data
print(digits.data)

# Print out the target values
print(digits.target)

# Print out the description of the `digits` data
print(digits.DESCR)

Now let's look a bit more into the content of our dataset. The shapes of our training data and labels (targets in this case) tell us how much data we have to work with, and what its dimensions are, and how many labels or categories it falls into.

In [None]:
# Isolate the `digits` data
digits_data = digits.data

# Inspect the shape
print(digits_data.shape)

# Isolate the target values with `target`
digits_target = digits.target

# Inspect the shape
print(digits_target.shape)

# Print the number of unique labels
number_digits = len(np.unique(digits.target))

# Isolate the `images`
digits_images = digits.images

# Inspect the shape
print(digits_images.shape)

Of course, there is no better way to visualize data than by plotting it out!

In [None]:
# Figure size (width, height) in inches
fig = plt.figure(figsize=(6, 6))

# Adjust the subplots 
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# For each of the 64 images
for i in range(64):
    # Initialize the subplots: add a subplot in the grid of 8 by 8, at the i+1-th position
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    # Display an image at the i-th position
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')
    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))

# Show the plot
plt.show()

We are working with data that has several features. These features could be amount of light/dark pixels, number of straight lines, number of curved lines, number of loops, etc. However, working with so many different features is difficult for amachine learning algorithm, and often unnecessary. This is because certain features do not contribute much more information than other features, and hence these features can be combined into a single feature to reduce dimensionality. Principal Component Analysis or PCA attempts to do just this.

In [None]:
from sklearn.decomposition import PCA

# Create a regular PCA model 
pca = PCA(n_components=2)

# Fit and transform the data to the model
reduced_data_pca = pca.fit_transform(digits.data)

# Inspect the shape
reduced_data_pca.shape

In [None]:
colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']
for i in range(len(colors)):
    x = reduced_data_pca[:, 0][digits.target == i]
    y = reduced_data_pca[:, 1][digits.target == i]
    plt.scatter(x, y, c=colors[i])
plt.legend(digits.target_names, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title("PCA Scatter Plot")
plt.show()

Before we work with our data in a machine learning model, we need to pre-process it first. Raw data is almost never ready to be used by a mchine learning algorithm, and data preprocessing is usually the longest and most intensive part of a data scientists job.

In [None]:
from sklearn.preprocessing import scale

# Apply `scale()` to the `digits` data
# This gives the data a mean of 0 and a variance of 1
data = scale(digits.data)

Next we split our data into a training set and a testing set. With very large datasets it is also useful to have a validation set.

In [None]:
# Import `train_test_split`
from sklearn.model_selection import train_test_split

# Split the `digits` data into training and test sets
X_train, X_test, y_train, y_test, images_train, images_test = train_test_split(data, digits.target, digits.images, test_size=0.25, random_state=42)

Let's inspect the numbers after this split.

In [None]:
# Number of training features
n_samples, n_features = X_train.shape

# Print out `n_samples`
print(n_samples)

# Print out `n_features`
print(n_features)

# Number of Training labels
n_digits = len(np.unique(y_train))

# Inspect `y_train`
print(len(y_train))

Now, we use a clustering algorithm to build our classifier. We choose a clustering algorithm in this case because it makes the most sense given the kind of data we have - digits with certain features being more similar than others, hence more closely "clustered" in feature-space.

In [None]:
# Import the `cluster` module
from sklearn import cluster

# Create the KMeans model
clf = cluster.KMeans(init='k-means++', n_clusters=10, random_state=42)

# Fit the training data `X_train`to the model
clf.fit(X_train)

Let's visualize the cluster centers by plotting them out.

In [None]:
# Figure size in inches
fig = plt.figure(figsize=(8, 3))

# Add title
fig.suptitle('Cluster Center Images', fontsize=14, fontweight='bold')

# For all labels (0-9)
for i in range(10):
    # Initialize subplots in a grid of 2X5, at i+1th position
    ax = fig.add_subplot(2, 5, 1 + i)
    # Display images
    ax.imshow(clf.cluster_centers_[i].reshape((8, 8)), cmap=plt.cm.binary)
    # Don't show the axes
    plt.axis('off')

# Show the plot
plt.show()

And now we have a trained model! Let's try to visualize how well our classifier has fitted to the training data.

In [None]:
# Model and fit the `digits` data to the PCA model
X_pca = PCA(n_components=2).fit_transform(X_train)

# Compute cluster centers and predict cluster index for each sample
clusters = clf.fit_predict(X_train)

# Create a plot with subplots in a grid of 1X2
fig, ax = plt.subplots(1, 2, figsize=(8, 4))

# Adjust layout
fig.suptitle('Predicted Versus Training Labels', fontsize=14, fontweight='bold')
fig.subplots_adjust(top=0.85)

# Add scatterplots to the subplots 
ax[0].scatter(X_pca[:, 0], X_pca[:, 1], c=clusters)
ax[0].set_title('Predicted Training Labels')
ax[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y_train)
ax[1].set_title('Actual Training Labels')

# Show the plots
plt.show()

Now let's use our model to predict on the test set, and see how well we do here.

In [None]:
# Import `metrics` from `sklearn`
from sklearn import metrics

# Predict the labels for `X_test`
y_pred=clf.predict(X_test)

# Print out the confusion matrix with `confusion_matrix()`
print(metrics.confusion_matrix(y_test, y_pred))

In [None]:
metrics.accuracy_score(y_test, y_pred)

Not going to lie, this is pretty pathetic! Okay let's try to use a different machine learning algorithm on this same data.

In [None]:
# Import the `svm` model
from sklearn import svm

# Create the SVC model 
svc_model = svm.SVC(gamma=0.001, C=100., kernel='linear')

# Fit the data to the SVC model
svc_model.fit(X_train, y_train)

svc_model.score(X_test, y_test)  

This is incredible! We have almost 100% accuract using the Support Vector Machine model! Let's visualize this in a plot.

In [None]:
# Assign the predicted values to `predicted`
predicted = svc_model.predict(X_test)

# Zip together the `images_test` and `predicted` values in `images_and_predictions`
images_and_predictions = list(zip(images_test, predicted))

# For the first 4 elements in `images_and_predictions`
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
    # Initialize subplots in a grid of 1 by 4 at positions i+1
    plt.subplot(1, 4, index + 1)
    # Don't show axes
    plt.axis('off')
    # Display images in all subplots in the grid
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    # Add a title to the plot
    plt.title('Predicted: ' + str(prediction))

# Show the plot
plt.show()

### 5. Some more useful libraries and your next steps!
This list of libraries is by no means extensive, but it is enough to get you start on your journey to being a data scientist. Other useful libraries to learn would be:
1. [sqlite3](https://docs.python.org/2/library/sqlite3.html): A library that let's you use SQL commands within Python to construct and query databases.
2. [ggplot](http://ggplot.yhathq.com/): A plotting and visualization library that R programmers are very familiar with.
3. [seaborn](https://seaborn.pydata.org/): A very high level plotting interface based on matplotlib, that allows you to make fancy and interactive graphics.
4. [TensorFlow](https://www.tensorflow.org): A deep learning library developed by Google. Complex syntax, but is useful for making machine learning models that can be deployed on various platforms and devices.
5. [PyTorch](https://pytorch.org): A deep learning library developed by Facebook. Easy syntax that is similar to numpy.
6. [Keras](https://keras.io/): A high level deep learning library built on top of, and within TensorFlow.
7. [FastAI](https://fast.ai): A high level deep learning library built on top of PyTorch, which is the best resource for winnig competitions on Kaggle.

You have now come to the end of your three week journey with Python. You have access to the three Python notebooks we worked with, and you can also use these resources to further your knowledge of Python programming and become more proficient.  
1. [Python for Everybody Specialization](https://www.coursera.org/learn/python): A series of online courses offered by the University of Michigan, which starts from the absolute basics of programming, and takes you to a high level of proficiency. It is comprehensive, yet easy to follow.
2. [Datacamp](https://www.datacamp.com/): A website with multiple online courses on various topics, including Python programming, fundamentals of data science, and machine learning.
3. [Kaggle](https://kaggle.com): A website where you can participate in several data science competitions, and improve your skills.

Thank you for being a part of these workshops! I wish you luck in your future endeavours.