# Building Blocks

### Step 1: Launch Jupyter

You made it here, so great job!

### Step 2: Getting Familiar with Jupyter Notebook
Jupyter Notebook is one of the most popular development environments for data scientists. As the name implies, it's meant to be a notebook of sorts where you can quickly jot down ideas, experiment, and see results. It's not typically used for production; although, there are projects available that aim to make notebooks accessible in production environments.

---
#### HIGHLIGHT AND RUN THIS MARKDOWN CELL

A notebook contains individual cells which can be used either for an executable block of code or markdown enabled text for notes. Each cell can be toggled between Code and Markdown from the menu. 

This cell is set to Markdown and can be formatted with **bold**, *italic*, <font color="blue">colors</font>, etc. This allows for creation of beautifully documented live experimentation and analysis. To view in Markdown mode, highlight the cell, and click "Run" button on the top menu or by using `SHIFT`+`RETURN` on your keyboard. Double click anywhere in the cell to toggle back to edit mode.

For more information on Markdown https://www.markdownguide.org/cheat-sheet/

In [None]:
## SELECT AND RUN THIS CODE CELL
##This is a code cell used to create and execute blocks of python code. To execute this code and view output, select and run this cell.

# Use the print statement to show output
print("hello world")

- You can change the code and rerun it to see any changed output.<br>
- You can also clear output from menu by highlighting cell and then from menu using<br>`Cell->Current Outputs->Clear`

Go ahead and play around with the next few code cells to familiarize yourself with Jupyter

In [None]:
# Print can also be used with functions
foo = "hello"
print(" ".join([foo, "world"]))

# Not just for strings
print(1+3)

# The last line to be evaluated will also be shown in the output
f'The length of foo is {len(foo)}'

Variables set in a cell are available throughout the notebook, not just the cell in which they are created. However, individual cells do not have to be run from top to bottom or in any certain order, so the variable will only be available if the cell has been ran/executed in the current session.

Run the cell below and you will see that `foo` can be used, but trying to access `bar` will throw an error since it has yet to be defined.

In [None]:
print(foo)
print(bar)

Now run the cell below and then rerun the one above and see that `bar` is now available

In [None]:
bar = "world"

This was a quick and relatively shallow overview of Jupyter Notebooks, but it should be enough to help you start navigating and getting hands-on with the fun stuff. To find out more, visit https://jupyter.org/

### STEP 3: Install NumPy

If you are running this tutorial locally, you can use `pip install` from your command line. Installing packages from within a notebook directly is not generally a great idea, but we will do it for purposes of this tutorial especially since we are using mybinder.org which is a temporary notebook server generated from a notebook on a github repository.

In [None]:
#RUN THIS CELL TO INSTALL NUMPY
import sys
!{sys.executable} -m pip install numpy

### Step 4: Using NumPy

NumPy is heavily used in python when dealing with data science and machine learning projects. It's a super efficient tool for creating and processing multi dimensional arrays.

In [None]:
#SIMPLE ARRAY CODE HERE


Numpy's strength lies within it's ability to process large, multi-dimensional arrays

In [None]:
#MULTI-DIMENSIONAL ARRAY CODE HERE


- Numpy arrays are optimized for memory usage and size is defined up at declaration. If you want to append an unknown amount of items dynamically, a native Python structure may be a better fit
- Numpy has many powerful mathematical functions that can be performed on arrays
- Find more info https://numpy.org/doc/stable/user/quickstart.html#the-basics

### Step 5: Using Pandas

Pandas is like a small, in-memory database that allows for processing data. Internally, it utilizes numpy, but adds some convenient query and data access. One very common usage is to import and export CSV files to and from a Pandas DataFrame. Pandas has a domain specific language to itself, so you should keep the documentation handy! https://pandas.pydata.org/docs/reference/

The two main Pandas data structures are Series and DataFrame. A Series is a labeled one-dimensional array and a DataFrame is a two dimensional array with rows, columns, and labels. All individual rows and columns of a DataFrame are made up of individual Series.

In [None]:
#INSTALL AND IMPORT PANDAS
!{sys.executable} -m pip install pandas
import pandas as pd

In [None]:
#CREATE SERIES FROM LIST CODE HERE


In [None]:
#CREATE SERIES FROM DICTIONARY CODE HERE


In [None]:
#CREATE DATAFRAME FROM SERIES CODE HERE


#### Run these cells to see more about the characteristics and capabilities of a DataFrame

In [None]:
#individual row of a DataFrame is a Series
print(egeeks_df.loc['Thomas'])
print(type(egeeks_df.loc['Thomas']))

In [None]:
#individual column of a DataFrame is a Series
print(egeeks_df['Ring Size'])
print(type(egeeks_df['Ring Size']))

In [None]:
#add another column directly from list
egeeks_df['City'] = ['Jasper', 'York', 'Lippstadt', 'New York', 'New Albany']
print(egeeks_df)

In [None]:
#add column with single value
egeeks_df['Favorite Language'] = 'ABAP'
print(egeeks_df)

In [None]:
#save our egeeks data to csv
egeeks_df.to_csv('egeeks_data.csv', index_label='Name') #give index a label for csv column

In [None]:
#load our egeeks data from csv
egeeks_df = None
egeeks_df = pd.read_csv('egeeks_data.csv', index_col='Name') #specify which column to use as index
print(egeeks_df)

In [None]:
#dataframes are queryable!
print("egeeks living in Indiana:")
print(egeeks_df.loc[egeeks_df['State'] == 'Indiana'])
print("\negeeks with small fingers:")
print(egeeks_df.loc[egeeks_df['Ring Size'] <= 8])

# Machine Learning

Now that you are familiar with some of the basic tools, it's time to dive into machine learning. In the following steps, you will build a machine learning model that will predict if a line of text is more likely a quote from a `Star Wars` character or `Elon Musk`. May the force be with you.

### Step 6: Install Scikit-Learn
For this tutorial, you will be using scikit-learn, a popular machine learning library for python.

In [None]:
#INSTALL SCIKIT-LEARN
!{sys.executable} -m pip install sklearn

### Step 7: Prepare Elon Musk Data

The data files that will be used, are already available in this notebook server. To view all the available files, including the `.ipynb` notebook file itself, use menu option `File->Open`.

The Elon Musk dataset contains tweets and retweets made by `@elonmusk` between November 2012 and September 2017. Source: https://www.kaggle.com/kulgen/elon-musks-tweets

In [None]:
#LOAD MUSK DATA


In [None]:
#PREVIEW MUSK DATA


In [None]:
#CREATE NEW DATAFRAME WITH NO RETWEETS


In [None]:
#CLEAN UP MUSK DATA FOR PROCESSING


In [None]:
#PREVIEW PREPROCESSED MUSK DATA


### Step 8: Prepare Star Wars Data
The data files are movie lines from the original trilogy. Source: https://www.kaggle.com/xvivancos/star-wars-movie-scripts

In [None]:
#LOAD MOVIE DATA


In [None]:
#CREATE NEW DATAFRAME WITH ALL MOVIE LINES


### Step 9: Combine Datasets

In [None]:
#ADD COLUMN TO BOTH DATAFRAMES TO INDICATE SOURCE


In [None]:
#COMBINE INTO SINGLE DATAFRAME


In [None]:
#PREVIEW PREPROCESSED COMBINED FULL DATASET


### Step 10: Vectorization - Fit
Machines like numbers, not text. In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors. There are many different methods and theories for vectorization and some options are better depending on the use case. For this tutorial, we will use TF-IDF (Term Frequency times Inverse Document Frequency).

TF-IDF is a way of adjusting the importance or weight of individual terms in relation to all of the terms and frequencies of the entire dataset.

In [None]:
#VECTORIZATION FIT CODE HERE


The vectorizer is fit with the whole dataset, which now allows for new or existing lines of text to be mapped into vectors for processing

In [None]:
#show vector representation of text line


### Step 11: Shuffle and Split Data
It is important to hold back some data to test the model's accuracy.

In [None]:
#SHUFFLE AND SPLIT CODE HERE


Nice...data is split and shuffled with 80% for training and 20% for testing

### Step 12: Vectorization - Transform the Training Data
In the previous vectorization step, the vectorizer object was fit with the entire dataset to create a map of terms. Now you will transform each individual line of text from the training data into a vector for fitting the text classifier model.

In [None]:
#TRANSFORM TRAINING DATA CODE 


You can see that the result is a matrix of 224 lines of vectorized data, one for each line in our training data

### Step 13: Train the Model
Time to get into the fun stuff, training a machine learning model! Lets train a simple sklearn Naive Bayes classifier.

Now that the training data has been vectorized, it's time to train/fit the model. The input for the following `fit` method is 2 iterable arrays, one for the vectorized representation of the text and the other for the target value of `Star Wars` or `Elon Musk`. It is expected that the arrays are the same size and order.

In [None]:
#TRAIN CLASSIFIER MODEL


Really is that it? Yep, your machine learning model has been trained...amazing, isn't it?

### Step 14: Predict the Test Data
After the model has been trained, it can be used to predict the target values of other data. 

In [None]:
#RUN THIS JUST TO TEMPORARILY IGNORE WARNINGS WHILE WE PREVIEW DATAFRAME SLICES
import warnings
warnings.filterwarnings('ignore')

In [None]:
#PREDICT TEST DATA CODE HERE 


Machine learning magic!

### Step 15: Metrics
To measure accuracy, use the test predictions and analyze results.

In [None]:
#ACCURACY


In [None]:
#CONFUSION MATRIX


### Step 16: Chart Confusion Matrix
The confusion matrix shows our agreements and disagreements for both target values. To make it more clear, we can use matplotlib to generate a more user friendly version of the confusion matrix.

In [None]:
#INSTALL MATPLOTLIB
!{sys.executable} -m pip install matplotlib

In [None]:
#BUILD CHART CODE HERE


The confusion matrix shows:
* Top left: # of predictions where the actual and predicted value was Elon Musk
* Top right: # of predictions where the actual value was Elon Musk but the predicted value was Star Wars
* Bottom left: # of predictions where the actual value was Star Wars but the predicted value was Elon Musk
* Bottom right: # of predictions where both the actual and predicted value was Star Wars

### Step 17: Build Pipeline
Once the model is built, you will most likely want to use it to predict and classify future lines of dialogue. It's important that you preprocess and vectorize any future data the same way in which it was built before feeding it in for classification. Scikit-Learn provides the concept of pipelines that are objects that can store all the necessary steps to process data.

In [None]:
#you can see here that you would have to run each line of text through the vectorization process before a prediction
test_line = "may the force be with you"
test_line_tfidf = tfidf_vectorizer.transform([test_line,])
sw_elon_clf.predict(test_line_tfidf)

In [None]:
#BUILD PIPELINE CODE HERE


In [None]:
#TEST A PREDICTION


In [None]:
#SAVE PIPELINE OBJECT


### Step 18: Ad Hoc Predictions
Now that the pipeline has been built, use it to run some ad hoc predictions. Have fun with it!

In [None]:
#LOAD PIPELINE AND RUN AD HOC PREDICTIONS


There you have it, Elon Musk loves SAP TechEd and Star Wars loves cookies!

### Step 19: BONUS: UI for Fun
As a fun bonus, create a quick UI in Jupyter notebook that will allow you to input some text, hit enter, and show the prediction

In [None]:
#BONUS FUN UI CODE HERE


Type in some text and hit enter to see who was more likely to say it.

### Step 20: The End - Congrats!
Wow, a fully functional, production like machine learning toy app! Hopefully you are now equipped with the necessary knowledge to begin your own adventure in Data Science and Machine Learning.