# AI in Production: Data Science Tools

This notebook is meant to accompany [these slides](https://docs.google.com/presentation/d/1nVuLd79JPxDpQ4wKsGPNIZJcw6Ocm-gAjJaxtNWAcpk/edit#slide=id.p). Originally prepared for Artificial Intelligence for Public Health (AI4PH) Summer Institute. Intended audience graduate public health students. 

<img src="https://miro.medium.com/max/1838/1*YPsZO50dIiEKpW9RqzqsTw.jpeg" alt="Data Science" style="height: 300px;"/>

## Environment Set Up

<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Anaconda
    
<img src="https://upload.wikimedia.org/wikipedia/en/thumb/c/cd/Anaconda_Logo.png/440px-Anaconda_Logo.png" alt="Anaconda Logo" style="height: 100px;"/>

**What is it**: Anaconda is a distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment. The distribution includes data-science packages suitable for Windows, Linux, and macOS.

**Why use it**: Anaconda lets you easily create and manage various environment with isolated dependencies.
 
**Download**: Download from [here](https://www.anaconda.com/products/individual)
    
**Resources**: Conda cheat sheet [here](https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf)
    
</div>

### Setting Up Anaconda Environment and Installing Jupyter Notebooks

After installing anaconda, open up your command prompt or terminal window

Create a new environment
```
conda create --name dstools

```

To see if the command worked and list all your environments run
```
conda env list
```

<img src="./list.png" alt="Image Info" style="height: 300px;"/>

Now activate the environment by running
```
conda activate dstools
```

Finally install jupyter notebooks
```
conda install jupyter notebook
```

Now lunch jupyter notebook by typing
```
jupyter notebook
```

<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Jupyter Notebook
    
<img src="https://www.dataquest.io/wp-content/uploads/2019/01/1-LPnY8nOLg4S6_TG0DEXwsg-1.png" alt="Jupyter Logo" style="height: 300px;"/>

**What is it**: A Jupyter notebook is a document that supports mixing executable code, equations, visualizations, and narrative text. Specifically, Jupyter notebooks allow the user to bring together data, code, and prose, to tell an interactive, computational story. Jupyter notebook is free and open source and supports over 100 programming languages including Python, Java, R, Julia, Matlab, Octave, Scheme, Processing, Scala, and many more

**Why use it**: Jupyter is an interactive environment that can be shared with anyone. They can run through the code, collaborate and understand the story by mixing formatted text with the code.

**Download**: Install from [here](https://jupyter.org/install)

**Resources**: Cheat sheet [here](https://www.edureka.co/blog/wp-content/uploads/2018/10/Jupyter_Notebook_CheatSheet_Edureka.pdf)

</div>

This will open a new tab in your default web browser that should look something like the following screenshot.

![Jupyter](Jupyter.png)

This is the Notebook Dashboard, specifically designed for managing your Jupyter Notebooks. Think of it as the launchpad for exploring, editing and creating your notebooks.

You can create a new notebook by clicking New in the top right hand corner

![Markdown](New.png)

This lets you create a new notebook. Currently the only option is Python. However if you have other kernels installed (such as R) you would see them here.

![Notebook](https://www.dataquest.io/wp-content/uploads/2019/01/new-notebook.jpg)

Once you have opened the notebook, the interface shouldn't be too hard to figure out. There are two new terms you have to learn

* Kernels: the engine that executes the code. Could be python, R, or other.
* Cells: container for text to be displayed in the notebook or code to be executed by the notebook’s kernel.

In each cell, you can either write code to execute or write markdown (text formatting system similiar to HTML). 

To execute code, you can use the run button in the toolbar, or use the shortcut of `shift + enter` Go ahead and execute the following code block which should print out `Hello AI4PH` by using either the run button or the shortcut keys

In [1]:
print("Hello AI4PH Two")

Hello AI4PH Two


When we run the cell, its output is displayed below and the label to its left will have changed from In [ ] to In [1]. Representing the first code cell being run

Run the cell again and the label will change to In [2] because now the cell was the second to be run on the kernel. This numbering system makes it clear the order in which your code was run. While it's good practice to clean up your code and make sure it should be run from the top of the page to the bottom, this requirements is not enforced. Which is why this number system is so cruicial. 

In [2]:
!pip3 install bs4
!pip3 install selenium
!pip3 install pandas
!pip3 install html5lib
!pip3 install matplotlib
!pip3 install sklearn



In [3]:
import pandas as pd

If you have git installed on your local computer, you can get the code by cloning from github by running ```git clone https://github.com/farbodab/ds-tools-ai4ph.git```

<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Git
    
<img src="https://miro.medium.com/max/910/1*Wjxx83j-qyiNvFBy1yOA1w.jpeg" alt="Git Logo" style="height: 300px;"/>

**What is it**: Git is a software that helps you keep track of changes to files in a folder on your PC. After making some changes to files in this folder, you can “commit” the changes to Git for safe-keeping. These changes could be creating, renaming, deleting a file or subfolder; or editing the content of a file.

**Why use it**: Git lets you keep track of changes over time and revert back when needed.

**Download**: Install from [here](https://git-scm.com/downloads)

**Resources**: Cheat sheet [here](https://i.redd.it/8341g68g1v7y.png)

</div>

<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Github
    
<img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" alt="Github Logo" style="height: 300px;"/>

**What is it**: Github lets you keep a copy of your local git repositories in the cloud

**Why use it**: It lets you share and collaborate with others via Git's version control.

**Download**: Create an account [here](https://github.com/)

</div>

## Business Understanding

After a number of back and forths meetings with our bosses, we have a fairly good understanding of what they are trying to do.

**Problem**: 

**Business Requirement**: 
* The health system is looking to use data to plan for capacity and for response planning
* Stakeholders are looking for a weekly email to be delivered to them which has the 7 day prediction of COVID-19 cases.

## Analytics Approach

The nature of the problem is a predictive one. Since the goal is to predict number of cases, then the task at hand is to building, testing and implementing a regression model.

## Data Requirements

As a starting point, we need data on daily covid cases. We already have a number of different data elements that may improve the predictive power of the proposed model (location of cases, socioeconomic factors tied to geography, age of cases, etc). The process is entirely iterative. 

## Data Collection

### Web Scraping

In the beggining of the pandemic, availability of COVID-19 data was sparse. The only source of such data could be found on Ontario Government's website below.
![status](./status.png)

To extract data from the website we need to scrape the webpage and extract it from the underlying code that renders the website. While data sources have matured over time, web scrapping is a fairly unique and extremely useful skilset to have in your toolbelt when doing public health on the go. 

In [4]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome(executable_path='./chromedriver')
driver.get("http://www.python.org")
elem = driver.find_element_by_name("q")
elem.clear()
elem.send_keys("pycon")
elem.send_keys(Keys.RETURN)
driver.close()

SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 91
Current browser version is 99.0.4844.83 with binary path /Applications/Google Chrome.app/Contents/MacOS/Google Chrome


<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Selenium
<img src="https://camo.githubusercontent.com/74ed64243ba05754329bc527cd4240ebd1c087a1/68747470733a2f2f73656c656e69756d2e6465762f696d616765732f73656c656e69756d5f6c6f676f5f7371756172655f677265656e2e706e67" alt="Selenium" style="width: 200px;"/> 
    
**What is it**: It's a tool used to scrape data from the web which lets you automate the operations of your browser

**Why use it**: A lot of information remains locked in websites and is only updated there without being available in a easy to use format. Selenium lets us get access to that info and turn it into usable data.

**Download**: Follow instructions [here](https://selenium-python.readthedocs.io/) to set it up.

**Resources**: Cheat sheet [here](https://ivantay2003.medium.com/selenium-cheat-sheet-in-python-87221ee06c83)
    
**Notes**: Users on MacOS will need to do to an additional step to be able to use the driver downloaded from above, which is to mark the file as safe by using the command ```xattr -d com.apple.quarantine name-of-file ```
    
</div>

If selenium is set up, this should launch your browser to python's website, and search for the words pycon.

Now that we know things are working, let's go to our real example.

In [None]:
url = "https://www.ontario.ca/page/how-ontario-is-responding-covid-19"
driver = webdriver.Chrome(executable_path='./chromedriver')
driver.get(url)

This should open a new chromium page and navigate it to our data source. Now we use the xpath of the table of interest to get the specific element on the page and we save its html content to a variable

In [None]:
element = driver.find_element_by_xpath('//*[@id="pagebody"]/table[1]')
element_html = element.get_attribute('outerHTML')
driver.close()

The next library we will be using today is called pandas. Pandas is the go to data manipulation and wrangling library for python

In [None]:
import pandas as pd

df = pd.read_html(element_html)[0]
df.head()

<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Pandas

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/1200px-Pandas_logo.svg.png" alt="Selenium" style="height: 200px;"/>

**What is it**: Pandas is a python library for data manipulation and analysis. It is the de facto library for data manipulation in python.

**Why use it**: It's extremely powerful. It lets you read in a variety of dataformats and data sources, aggregate, transform, and combine data. 

**Download**: Install it by using pip ```pip install pandas```

**Resources**: Cheat sheet [here](https://ainfographics.files.wordpress.com/2017/10/python-pandas-cheat-sheet.png)
    
</div>

Now that we have our table of interest we can just save it to file, so we can start to keep track of cases daily. 

In [None]:
df.to_csv('cases-july-17-2021.csv',index=False)

### Official Data Source

The main dataset we will be using is from [here](https://data.ontario.ca/en/dataset/confirmed-positive-cases-of-covid-19-in-ontario).

In [None]:
url = "https://data.ontario.ca/dataset/f4112442-bdc8-45d2-be3c-12efae72fb27/resource/455fd63b-603d-4608-8216-7d8647f43350/download/conposcovidloc.csv"
all_cases = pd.read_csv(url)

In [None]:
all_cases.head()

## Data Understanding

In [None]:
all_cases.head()

In [None]:
all_cases.info()

In [None]:
all_cases['Age_Group'].value_counts()

In [None]:
all_cases['Client_Gender'].value_counts()

In [None]:
all_cases['Outcome1'].value_counts()

In [None]:
all_cases['Outbreak_Related'].value_counts()

## Data Preparation

In [None]:
agg = all_cases.groupby(['Accurate_Episode_Date'])['Row_ID'].count().to_frame().reset_index()
agg['Accurate_Episode_Date'] = pd.to_datetime(agg['Accurate_Episode_Date'])
agg.tail()

Next. Let's visualize the data to make sure our aggregation worked as expected

In [None]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.cbook as cbook

fig = plt.figure()
ax = plt.axes()

x = agg['Accurate_Episode_Date']
y = agg['Row_ID']
ax.plot(x, y)

# Major ticks every 6 months.
fmt_half_year = mdates.MonthLocator(interval=6)
ax.xaxis.set_major_locator(fmt_half_year)

# Minor ticks every month.
fmt_month = mdates.MonthLocator()
ax.xaxis.set_minor_locator(fmt_month)

# Text in the x axis will be displayed in 'YYYY-mm' format.
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))

plt.title("Number of cases over time")

<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Matplotlib

<img src="https://matplotlib.org/_static/logo2_compressed.svg" alt="Matplotlib" style="height: 200px;"/>

**What is it**: Matplotlib is a python library for plotting and data visualization. 

**Why use it**: It's free and open source and extremely customizable. 

**Download**: Install it by using pip ```pip install matplotlib```

**Resources**: Cheat sheet [here](https://raw.githubusercontent.com/matplotlib/cheatsheets/master/cheatsheets-1.png)
    
</div>

We have a timeseries problem. Let's get the data in shape for it. Let's drop any data in 2019 as the tail is not going to be as relevant for predicting the most recent data. 

In [None]:
agg.head()

In [None]:
agg.tail()

In [None]:
agg = agg.drop(agg.loc[agg.Accurate_Episode_Date < '2020-01-01'].index)

In [None]:
agg.head()

In [None]:
agg['t-1'] = agg['Row_ID'].shift(1)
agg['t-2'] = agg['Row_ID'].shift(2)
agg['t-3'] = agg['Row_ID'].shift(3)
agg['t-4'] = agg['Row_ID'].shift(4)
agg['t-5'] = agg['Row_ID'].shift(5)
agg['t-6'] = agg['Row_ID'].shift(6)
agg['t-7'] = agg['Row_ID'].shift(7)
agg['t'] = agg['Row_ID']

In [None]:
agg.head(14)

In [None]:
agg.dropna(how='any',inplace=True)

In [None]:
agg.head()

In [None]:
agg.tail()

In [None]:
x = agg[['t-1','t-2','t-3','t-4','t-5','t-6','t-7']]
y = agg['t']

test_size = 0.3
dataset_size = len(agg)
test_index = int(test_size * dataset_size)


x_train, x_test, y_train, y_test = x[:test_index], x[test_index:], y[:test_index], y[test_index:]

In [None]:
x.shape

In [None]:
y.shape

In [None]:
x_train.head()

In [None]:
y_train.head()

## Modelling and Evaluation

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error


regr = RandomForestRegressor()
regr.fit(x_train,y_train)
y_pred = regr.predict(x_test)
mean_absolute_error(y_test,y_pred)

In [None]:
from sklearn.linear_model import LinearRegression

regr = LinearRegression()
regr.fit(x_train,y_train)
y_pred = regr.predict(x_test)
mean_absolute_error(y_test,y_pred)

In [None]:
from sklearn.linear_model import Lasso

regr = Lasso()
regr.fit(x_train,y_train)
y_pred = regr.predict(x_test)
mean_absolute_error(y_test,y_pred)

In [None]:
from sklearn.linear_model import Ridge

regr = Ridge()
regr.fit(x_train,y_train)
y_pred = regr.predict(x_test)
mean_absolute_error(y_test,y_pred)

In [None]:
from sklearn.svm import SVR

regr = SVR(kernel='rbf')
regr.fit(x_train,y_train)
y_pred = regr.predict(x_test)
mean_absolute_error(y_test,y_pred)

<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Scikit-Learn

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1200px-Scikit_learn_logo_small.svg.png" alt="Sklearn" style="height: 200px;"/>

**What is it**: Scikit-Learn is a python library for machine learning. 

**Why use it**:  It has various classification, regression and clustering algorithms and is designed to run efficiently. 

**Download**: Install it by using pip ```pip install sklearn```

**Resources**: Library cheat sheet [here](https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2018/09/Python-Cheat-Sheet-for-Scikit-learn-Edureka.pdf?utm_source=blog&utm_medium=cheat-sheet&utm_campaign=Scikit-Cheat-Sheet-28-09-2018-AH). Model selection cheat sheet [here](https://scikit-learn.org/stable/_static/ml_map.png)
    
</div>

## Deployment

Now that you have a model you have to figure out how you want to present the results and deploy it into production. This could be in the form of a visualization in tableau or powerbi, it could be in the form of an email to your boss that gets sent every day, the numbers itself could be displayed live on a site or so much more. 

Depending on the use case, it may make sense to deploy the model into the cloud and let it run there automatically. Below I will introduce a few more tools for you to be aware of

<div style="padding: 10px; border: 2px black solid;">

## <font color='green'>New Tool Alert:</font> Docker

<img src="https://logos-world.net/wp-content/uploads/2021/02/Docker-Symbol.png" alt="Sklearn" style="height: 200px;"/>

**What is it**: Containerization tool mainly used for shipping and running applications quickly across different platforms

**Why use it**:  Four main advantages:
* Isolation: It helps us create an environment agnostic system. Your application runs smoothly on different platforms. This is essentially achieved using containers. 
* Portability: Since all of your dependencies are in the same container, it’s easy to carry from one place to another giving Docker its portability.
* Lightweight: Runs as another application on your system instead of consuming whole lot resources of your system.
* Robustness: Less demanding in terms of hardware and needs very little memory as compared to VMs, hence providing efficient isolation levels which help save not only the cost but also time.

**Download**: Download it from [here](https://docs.docker.com/get-docker/)

**Resources**: Cheat sheet [here](https://www.docker.com/sites/default/files/d8/2019-09/docker-cheat-sheet.pdf)
    
</div>

You can see that there in the deploy folder, key aspects of the code have been moved over to a ```deploy.py``` foldere. Furthermore there is a new file called a ```Dockerfile``` that specifies the instructions for running the deploy file. If you have docker installed locally, you can try to build the image using the ```docker build -t dstools:1.0 .``` command and then run it using ```docker container run --name myimage``` 

This docker folder can now be run on the cloud, someone else's computer or anywhere that has docker installed. If you want to deploy your own model, you can use a cloud infrastructure provider such as AWS, GCP, or Azure or use a platform as a service provider such as Heroku.