# Essential Machine Learning and Exploratory Data Analysis with Python and Jupyter Notebook





## Pragmatic AI Labs
![alt text](https://paiml.com/images/logo_with_slogan_white_background.png)

This notebook was produced by [Pragmatic AI Labs](https://paiml.com/).  You can continue learning about these topics by:

*   Buying a copy of [Pragmatic AI: An Introduction to Cloud-Based Machine Learning](http://www.informit.com/store/pragmatic-ai-an-introduction-to-cloud-based-machine-9780134863917)
*   Reading an online copy of [Pragmatic AI:Pragmatic AI: An Introduction to Cloud-Based Machine Learning](https://www.safaribooksonline.com/library/view/pragmatic-ai-an/9780134863924/)
*   Viewing more content at [noahgift.com](https://noahgift.com/)


# Part 2.1: IO Operations in Python and Pandas and ML Project Exploration

## Working with Files

### Writing to a file

In [1]:
f = open('workfile.txt', 'w')
f.write("foo")
f.close()
!cat workfile.txt
!ls -l

foototal 52736
-rw-r--r--  1 noahgift  staff     99050 Aug 10 06:04 Public_Master_SafariOnline_Day1_Part1.ipynb
-rw-r--r--@ 1 noahgift  staff     67812 Aug  7 09:04 Public_SafariOnline_Day1_Part2.ipynb
-rw-r--r--  1 noahgift  staff  26806876 Aug 10 05:28 Public_SafariOnline_Day2_Part1.ipynb
-rw-r--r--@ 1 noahgift  staff     15488 Aug  7 09:04 Public_SafariOnline_Day2_Part2.ipynb
-rw-r--r--  1 noahgift  staff         3 Aug 10 06:05 workfile.txt


### Writing to a file with 'context'

In [2]:
with open("workfile.txt", "w") as workfile:
    workfile.write("bam")
!cat workfile.txt

bam

### Reading a file in


In [3]:
f = open("workfile.txt", "r")
out = f.readlines()
f.close()
print(out)

['bam']


### Reading a file with 'context'

In [4]:
with open("workfile.txt", "r") as workfile:
    #print(workfile.readlines())
    print(workfile.read())


bam


## Serialization Techniques


### Serialize a Python Dictionary to Pickle

In [5]:
mydict = {"one":1, "two":2}

In [6]:
import pickle

In [7]:
pickle.dump(mydict, open('mydictionary.pickle', 'wb'))

In [8]:
!ls -l mydictionary.pickle

-rw-r--r--  1 noahgift  staff  32 Aug 10 06:05 mydictionary.pickle


In [9]:
!cat mydictionary.pickle

�}q (X   oneqKX   twoqKu.

In [10]:
res = pickle.load(open('mydictionary.pickle', "rb"))

In [11]:
print(res)

{'one': 1, 'two': 2}


### Serialize a Python Dictionary to JSON


In [12]:
import json
with open('data.json', 'w') as outfile:
    json.dump(res, outfile)

In [13]:
!cat data.json

{"one": 1, "two": 2}

In [14]:
with open('data.json', 'rb') as outfile:
    res2 = json.load(outfile)

In [15]:
print(res2)
type(res2)

{'one': 1, 'two': 2}


dict

### Save to Yaml

In [16]:
import yaml

ModuleNotFoundError: No module named 'yaml'

In [None]:
with open("data.yaml", "w") as yamlfile:                                               
    yaml.safe_dump(res2, yamlfile, default_flow_style=False)

In [None]:
!cat data.yaml

### Load Yaml

In [None]:
with open("data.yaml", "rb") as yamlfile:                                               
    res3 = yaml.safe_load(yamlfile) 

In [None]:
print(res3)
type(res3)

## Use Pandas DataFrames

#### Creating Pandas DataFrames

##### Creating DataFrames CSV file



*   Can be local
*   Can be hosted on a website



In [None]:
import pandas as pd
mma_df = pd.read_csv('../data/ufc_fights_all.csv')
mma_df.head()

##### List to Pandas DataFrame 

Convert a list to Pandas DataFrame

In [None]:
winning_technique_list = mma_df['method_d'].tolist()
techniques_df = pd.DataFrame(winning_technique_list)
techniques_df.columns = ["stoppage"]
techniques_df.head()

##### Dictionary to Pandas DataFrame

Take dictionary full of dictionaries and make it a Pandas DataFrame

#### Exporting Pandas DataFrames

##### Pandas DataFrame Column to List

Can use "tolist"

In [None]:
winning_technique_list = mma_df['method_d'].tolist()
winning_technique_list[0:4]

##### Pandas DataFrame to Dictionary

Grab a couple of records and convert to Python dictionary

In [None]:
mma_dict = mma_df.head(2).to_dict()
mma_dict

##### Pandas DataFrame to CSV

Write out DataFrame using to_csv


In [None]:
mma_df.head().to_csv("small_mma_records.csv")
!cat small_mma_records.csv

#### Using Pandas on Ray

More info on Pandas:  https://rise.cs.berkeley.edu/blog/pandas-on-ray/

#### Using Google Sheets with Pandas DataFrames

Reference:  [Official Google Colab Documentation on IO](https://colab.research.google.com/notebooks/io.ipynb)

**Install Google Spreadsheet Library**

In [None]:
#!pip install --upgrade -q gspread

**Authenticate to API**

In [None]:
#from google.colab import auth
#auth.authenticate_user()

#import gspread
#from oauth2client.client import GoogleCredentials

#gc = gspread.authorize(GoogleCredentials.get_application_default())

**Create a Spreadsheet and Put Items in It**

Note, could use existing spreadsheet

In [None]:
#sh = gc.create('pramaticai-test')
#worksheet = gc.open('pramaticai-test').sheet1
#cell_list = worksheet.range('A1:A10')

#import random
#count = 0
#for cell in cell_list:
#  count +=1
#  cell.value = count
#worksheet.update_cells(cell_list)

**Convert Spreadsheet Data to Pandas DataFrame**

In [None]:
#worksheet = gc.open('pramaticai-test').sheet1
#rows = worksheet.get_all_values()
#import pandas as pd
#df = pd.DataFrame.from_records(rows)
#df.head()

## Concurrency in Python

#### Threads

Threads are the beatup Pinto of concurrency in Python.  They lack the ability to scale to multiple cores and often cause performance problems.  Almost always you should choose some other method of concurrency in Python.

*Typically they are used in situations where things are IO bound, not CPU bound.*

![Pinto](https://homeprohub.files.wordpress.com/2013/03/cost-of-window-replacement.jpg)





##### Simple Threading Example

In [None]:
import threading

def fight_club(x):
  
    print(f"Processing Thread# {num}: Calculating punch with attack strength {x} to the {x} power\n")
    return x**x
  
workers = []
for num in range(1,6):
  print(f"Queuing thread # {num}\n")
  thread = threading.Thread(target=fight_club, args=(num,))
  workers.append(thread)
  thread.start()

#### Using the subprocess command

A general purpose way to "Shell Out" to system commands


In [None]:
import subprocess
res = subprocess.Popen("ls -l", shell=True, stdout=subprocess.PIPE)
out = res.stdout.readlines()
print(out)
!ls -l

In [None]:
!ls -l

### Multiprocessing

#### Mapping processes to Functions

Processes are forked and run truly parallel (unlike threads)

In [None]:
from multiprocessing import Pool
import datetime
import time
import random

def fight_club(x):
  
    sleep_time = random.randrange(0,3)
    time.sleep(sleep_time)
    timestamp = datetime.datetime.now()
    print(f"Calculating punch with attack strength {x} to the {x} power: @timestamp {timestamp} with sleep {sleep_time}")
    return x**x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(fight_club, [1, 2, 3]))

#### Process Pool Joined on Queue (Threadlike behavior)

Mimicks Threading interface, but with actual multi-core functionality

In [None]:
from multiprocessing import Process, Queue

def f(q):
    q.put(["armbar", "kimura",  "Mata Leão"])

if __name__ == '__main__':
    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    print(f"Grabbing some attacks: {q.get()}")    
    p.join()

### Async IO in Python Examples

More info here:  https://docs.python.org/3/library/asyncio.html

**Using Python3 Async**

```python
import asyncio

def send_async_firehose_events(count=100):
    """Async sends events to firehose"""

    start = time.time() 
    client = firehose_client()
    extra_msg = {"aws_service": "firehose"}
    loop = asyncio.get_event_loop()
    tasks = []
    LOG.info(f"sending aysnc events TOTAL {count}",extra=extra_msg)
    num = 0
    for _ in range(count):
        tasks.append(asyncio.ensure_future(put_record(gen_uuid_events(), client)))
        LOG.info(f"sending aysnc events: COUNT {num}/{count}")
        num +=1
    loop.run_until_complete(asyncio.wait(tasks))
    loop.close()
    end = time.time()  
    LOG.info("Total time: {}".format(end - start))
  ```

**Using trollius library with Python 2:  DEPRECATED**

```python
"""Generates an Async MetaData call.  Note, this isn't available in Boto3
In [56]: res = all_metadata_async()
In [57]: res
Out[57]: 
[('ami-manifest-path', <Response [200]>),
 ('instance-type', <Response [200]>),
 ('instance-id', <Response [200]>),
 ('iam', <Response [200]>),
 ('local-hostname', <Response [200]>),
 ('network', <Response [200]>),
 ('hostname', <Response [200]>),
 ('ami-id', <Response [200]>),
 ('instance-action', <Response [200]>),
 ('profile', <Response [200]>),
 ('reservation-id', <Response [200]>),
 ('security-groups', <Response [200]>),
 ('metrics', <Response [200]>),
 ('mac', <Response [200]>),
 ('public-ipv4', <Response [200]>),
 ('services', <Response [200]>),
 ('local-ipv4', <Response [200]>),
 ('placement', <Response [200]>),
 ('ami-launch-index', <Response [200]>),
 ('public-hostname', <Response [200]>),
 ('public-keys', <Response [200]>),
 ('block-device-mapping', <Response [200]>)]
"""

import requests
import trollius

def get_metadata_api_urls():
    """Retrieves the api endpoints for metadata"""

    full_urls = {}
    metadata_url = "http://169.254.169.254/latest/meta-data/"
    resp = requests.get(metadata_url)
    urls = resp.content.split()
    for url in urls:
        stripped_url = url.rstrip("/")
        full_urls[stripped_url]=(os.path.join(metadata_url, url))
    return full_urls

def _get(key_url):
    key,url = key_url
    return key, requests.get(url)

def _do_calls(urls):
    loop = trollius.get_event_loop()
    futures = []
    for url in urls:
        futures.append(loop.run_in_executor(None, _get, url))
    return futures

@trollius.coroutine
def call():
    results = []
    futures = _do_calls(get_metadata_api_urls().items())
    for future in futures:
        result = yield trollius.From(future)
        results.append(result)
    raise trollius.Return(results)

def all_metadata_async():
    """Retrieves all available metadata for an instance async"""

    loop = trollius.get_event_loop()
    res = loop.run_until_complete(call())
   ```


###  AWS Lambda and Chalice

Standalone Lambda with Chalice:  http://chalice.readthedocs.io/en/latest/

```python
@app.lambda_function()
def send_message(event, context):
    """Send a message to a channel"""

    slack_client = SlackClient(SLACK_TOKEN)
    res = slack_client.api_call(
      "chat.postMessage",
      channel="#general",
      text=event
    )
    return res
```


### Larger Scale Concurrency



*   [AWS Step Functions with Lambda](https://aws.amazon.com/step-functions/)

![alt text](https://d1.awsstatic.com/product-marketing/Step%20Functions/OrderFullScreen.0e74c2f19d89a9325addb5bd746cd895b2e4c9c2.jpg)

*   [AWS Batch](https://aws.amazon.com/batch/)
![alt text](https://d1.awsstatic.com/Test%20Images/Kate%20Test%20Images/Dilithium_flowchart%20diagrams_v3_kw-02.322877d73eda8ed71a44db216a1d195550befac0.png)

*   [RabbitMQ Worker Farms-IBM Developerworks Article](https://www.ibm.com/developerworks/cloud/library/cl-optimizepythoncloud1/index.html)

![alt text](https://www.ibm.com/developerworks/cloud/library/cl-optimizepythoncloud2/figure1.gif)





## Walking through Social Power NBA EDA and ML Project

*[Read related material covered in Chapter 6 of Pragmatic AI](https://www.safaribooksonline.com/library/view/pragmatic-ai-an/9780134863924/ch06.xhtml#ch06)*

* Data Collection Sources
* Importing and merging DataFrames in Pandas 
* Creating correlation heatmaps 
* Using seaborn lmplot 
* Using linear regression in Python
* Using ggplot in Python 
* Doing KMeans clustering 
* Doing PCA with scikit-learn 
* Doing ML classification prediction with scikit-learn 
* Doing ML Regression prediction with scikit-learn 
* Using Plotly for interactive Data Visualization



#### Data Collection Sources 

![Collection of Data](https://user-images.githubusercontent.com/58792/40758183-e64ba7c4-6440-11e8-97c5-c408e0bc321e.png)

**Twitter Code:**

https://github.com/noahgift/socialpowernba/blob/master/socialpower/sptwitter.py

**Wikipedia Code:**

https://github.com/noahgift/socialpowernba/blob/master/socialpower/spwikipedia.py

#### Import and merge DataFrames in Pandas

In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline

In [None]:
attendance_df = pd.read_csv("../data/nba_2017_attendance.csv");attendance_df.head()

In [None]:
endorsement_df = pd.read_csv("../data/nba_2017_endorsements.csv");endorsement_df.head()

In [None]:
valuations_df = pd.read_csv("../data/nba_2017_team_valuations.csv");valuations_df.head()

In [None]:
salary_df = pd.read_csv("../data/nba_2017_salary.csv");salary_df.head()

In [None]:
pie_df = pd.read_csv("../data/nba_2017_pie.csv");pie_df.head()

In [None]:
plus_minus_df = pd.read_csv("../data/nba_2017_real_plus_minus.csv");plus_minus_df.head()

In [None]:
br_stats_df = pd.read_csv("../data/nba_2017_br.csv");br_stats_df.head()

In [None]:
elo_df = pd.read_csv("../data/nba_2017_elo.csv");elo_df.head()

In [None]:
attendance_valuation_df = attendance_df.merge(valuations_df, how="inner", on="TEAM")

In [None]:
attendance_valuation_df.head()

#### Understand correlation heatmaps and pairplots

Exploratory Data Analysis and Feature Engineering


In [None]:
sns.pairplot(attendance_valuation_df, hue="TEAM")

**Correlation Heatmap**

In [None]:
corr = attendance_valuation_df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

**Correlation DataFrame Output**

In [None]:
corr

**Creating a Pivot Table Based Heatmap in Seaborn**

A few patterns are detected:  Look at the *three highest valued signals*

In [None]:
valuations = attendance_valuation_df.pivot("TEAM", "TOTAL_MILLIONS", "VALUE_MILLIONS")

In [None]:
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("NBA Team AVG Attendance vs Valuation in Millions:  2016-2017 Season")
sns.heatmap(valuations,linewidths=.5, annot=True, fmt='g')

#### Using linear regression in Python

There is a signal here, attendence and valuation do seem to be related, but residual values look non-uniform.

In [None]:
results = smf.ols('VALUE_MILLIONS ~TOTAL_MILLIONS', data=attendance_valuation_df).fit()

In [None]:
print(results.summary())

In [None]:
sns.residplot(y="VALUE_MILLIONS", x="TOTAL_MILLIONS", data=attendance_valuation_df)

In [None]:
attendance_valuation_predictions_df = attendance_valuation_df.copy()

In [None]:
attendance_valuation_predictions_df["predicted"] = results.predict()

#### Use seaborn lmplot to plot predicted vs actual values

In [None]:
sns.lmplot(x="predicted", y="VALUE_MILLIONS", data=attendance_valuation_predictions_df)

##### Generating a RMSE (Root Mean Squared Error Prediction)

In [None]:
import statsmodels
rmse = statsmodels.tools.eval_measures.rmse(attendance_valuation_predictions_df["predicted"], attendance_valuation_predictions_df["VALUE_MILLIONS"])
rmse

#### Adding ELO (Strength of Schedule Ranking to DataFrame)

In [None]:
attendance_valuation_elo_df = attendance_valuation_df.merge(elo_df, how="inner", on="TEAM")

In [None]:
attendance_valuation_elo_df.head()

In [None]:
corr_elo = attendance_valuation_elo_df.corr()
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("NBA Team Correlation Heatmap:  2016-2017 Season (ELO, AVG Attendance, VALUATION IN MILLIONS)")
sns.heatmap(corr_elo, 
            xticklabels=corr_elo.columns.values,
            yticklabels=corr_elo.columns.values)

In [None]:
corr_elo

In [None]:
ax = sns.lmplot(x="ELO", y="TOTAL_MILLIONS", data=attendance_valuation_elo_df, hue="CONF", size=12)
ax.set(xlabel='ELO Score', ylabel='TOTAL ATTENDANCE IN MILLIONS', title="NBA Team AVG Attendance vs ELO Ranking:  2016-2017 Season")

In [None]:
attendance_valuation_elo_df.groupby("CONF")["ELO"].median()


In [None]:
attendance_valuation_elo_df.groupby("CONF")["TOTAL_MILLIONS"].median()

In [None]:
results = smf.ols('TOTAL_MILLIONS ~ELO', data=attendance_valuation_elo_df).fit()


In [None]:
print(results.summary())
      


In [None]:
val_housing_win_df = pd.read_csv("../data/nba_2017_att_val_elo_win_housing.csv");val_housing_win_df.head()

In [None]:
val_housing_win_df.columns

In [None]:
results = smf.ols('VALUE_MILLIONS ~COUNTY_POPULATION_MILLIONS+TOTAL_ATTENDANCE_MILLIONS+MEDIAN_HOME_PRICE_COUNTY_MILLIONS', data=val_housing_win_df).fit()
print(results.summary())

#### Using ggplot in Python

In [None]:
!pip install ggplot

In [None]:
from ggplot import *
ggplot(val_housing_win_df, aes(x="TOTAL_ATTENDANCE_MILLIONS", y="VALUE_MILLIONS",
                               color="WINNING_SEASON")) + geom_point(size=400)

#### Use k-means clustering

**Unsupervised Machine Learning**

*   Unlabeled Data
*   "Discovers" Labels
*  Finds Hidden Patterns


**References:**



1.  [Pragmatic AI](https://www.safaribooksonline.com/library/view/pragmatic-ai-an/9780134863924/ch06.html#ch06) 
2.   [Python Machine Learning](https://www.safaribooksonline.com/library/view/Python+Machine+Learning+-+Second+Edition/9781787125933/ch11.html#ch11lvl2sec114)




*NBA Season Faceted Cluster Plot *

![Discovering Clusters in the NBA](https://user-images.githubusercontent.com/58792/40759110-6a93a2f8-6445-11e8-980b-ecbb1a2cc029.png)

**Data Preparation for Clustering**

* Clustering on four columns:  Attendence, ELO, Valuation and Median Home Prices
* Scaling the data


In [None]:
numerical_df = val_housing_win_df.loc[:,["TOTAL_ATTENDANCE_MILLIONS", "ELO", "VALUE_MILLIONS", "MEDIAN_HOME_PRICE_COUNTY_MILLIONS"]]

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
print(scaler.fit(numerical_df))
print(scaler.transform(numerical_df))

In [None]:
from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3)
kmeans = k_means.fit(scaler.transform(numerical_df))
val_housing_win_df['cluster'] = kmeans.labels_
val_housing_win_df.head()

**2D Cluster Plot**

In [None]:
from ggplot import *
ggplot(val_housing_win_df, aes(x="TOTAL_ATTENDANCE_MILLIONS", y="VALUE_MILLIONS", color="cluster")) +\
geom_point(size=400) + scale_color_gradient(low = 'red', high = 'blue')

 Elbow method shows that 3 clusters is decent choice

In [None]:
distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i,
            init='k-means++',
            n_init=10,
            max_iter=300,
            random_state=0)
    km.fit(scaler.transform(numerical_df))
    distortions.append(km.inertia_)
    
plt.plot(range(1,11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.title("Team Valuation Elbow Method Cluster Analysis")
plt.show()

##### Silhouette Plot



In [None]:
km = KMeans(n_clusters=3,
            init='k-means++',
            n_init=10,
            max_iter=300,
            random_state=0)
y_km = km.fit_predict(scaler.transform(numerical_df))

In [None]:
import numpy as np
from matplotlib import cm
from sklearn.metrics import silhouette_samples
cluster_labels = np.unique(y_km)
n_clusters = cluster_labels.shape[0]
silhouette_vals = silhouette_samples(scaler.transform(numerical_df),
                                     y_km,
                                     metric='euclidean')
y_ax_lower, y_ax_upper = 0, 0
yticks = []
for i, c in enumerate(cluster_labels):
    c_silhouette_vals = silhouette_vals[y_km == c]
    c_silhouette_vals.sort()
    y_ax_upper += len(c_silhouette_vals)
    color = cm.jet(float(i)/n_clusters)
    plt.barh(range(y_ax_lower, y_ax_upper), c_silhouette_vals, height=1.0, edgecolor='none',color=color)
    yticks.append((y_ax_lower + y_ax_upper)/2)
    y_ax_lower += len(c_silhouette_vals)
silhouette_avg = np.mean(silhouette_vals)
plt.axvline(silhouette_avg,
            color="red",
            linestyle="--")
plt.yticks(yticks, cluster_labels + 1)
plt.ylabel('Cluster')
plt.xlabel('Silhouette coefficient')
plt.title('Silhouette Plot Team Valuation')
plt.figure(figsize=(20,10))
plt.show()

##### Agglomerative clustering (Hierachial) vs KMeans clustering


In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 3))
km = KMeans(n_clusters=2,
            random_state=0)
X = scaler.transform(numerical_df)
y_km = km.fit_predict(X)
ax1.scatter(X[y_km==0,0],
            X[y_km==0,1],
            c='lightblue',
            edgecolor='black',
            marker='o',
            s=40,
            label='cluster 1')
ax1.scatter(X[y_km==1,0],
            X[y_km==1,1],
            c='red',
            edgecolor='black',
            marker='s',
            s=40,
            label='cluster 2')
ax1.set_title('NBA Team K-means clustering')
from sklearn.cluster import AgglomerativeClustering

X = scaler.transform(numerical_df)
ac = AgglomerativeClustering(n_clusters=2,
                             affinity='euclidean',
                             linkage='complete')
y_ac = ac.fit_predict(X)
ax2.scatter(X[y_ac==0,0],
             X[y_ac==0,1],
             c='lightblue',
             edgecolor='black',
             marker='o',
            s=40,
            label='cluster 1')
ax2.scatter(X[y_ac==1,0],
            X[y_ac==1,1],
            c='red',
            edgecolor='black',
            marker='s',
            s=40,
            label='cluster 2')
ax2.set_title('NBA Team Agglomerative clustering')
plt.legend()
plt.show()

##### 3D Plot in R

![Valuation 3D Plot](https://user-images.githubusercontent.com/58792/36056809-7f87a266-0dbc-11e8-8877-9bb87905adbd.png)

Source Code:  https://github.com/noahgift/socialpowernba/blob/master/plot_team_cluster.R

#### Use PCA with sklearn

References:



1.  [ PCA sklearn](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)





In [None]:
import pandas as pd
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(numerical_df)
X = pca.transform(numerical_df)
print(f"Before PCA Reduction{numerical_df.shape}")
print(f"After PCA Reduction {X.shape}")

##### Simple Scatter Plot of Reduced Dimensions

In [None]:
plt.scatter(X[:, 0], X[:, 1])
plt.show()


#### Use ML classification prediction with scikit-learn

Create supervized classification prediction

In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit?

#### ML Regression prediction with scikit-learn

Create supervized regression prediction

In [None]:
from sklearn.neighbors import KNeighborsRegressor
neigh = KNeighborsRegressor(n_neighbors=2)
neigh.fit?


#### Using Plotly for interactive Data Visualization

*[Read related material covered in Chapter 10 of Pragmatic AI](https://www.safaribooksonline.com/library/view/pragmatic-ai-an/9780134863924/ch10.xhtml#ch10)*

Cell configuration to setup Plotly
Further documentation available from [Google on Plotly Colab Integration](https://colab.research.google.com/notebooks/charts.ipynb#scrollTo=YVhMPxwa-wmS)



In [None]:
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-1.5.1.min.js?noext',
            },
          });
        </script>
        '''))


##### Going Further with Real Estate Exploration

In [None]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn as sns; sns.set(color_codes=True)
from sklearn.cluster import KMeans
color = sns.color_palette()
%matplotlib inline

In [None]:
df = pd.read_csv("../data/Zip_Zhvi_SingleFamilyResidence.csv")

In [None]:
df.describe()

**Clean Up DataFrame**
Rename RegionName to ZipCode and Change Zip Code to String



In [None]:
df.rename(columns={"RegionName":"ZipCode"}, inplace=True)
df["ZipCode"]=df["ZipCode"].map(lambda x: "{:.0f}".format(x))
df["RegionID"]=df["RegionID"].map(lambda x: "{:.0f}".format(x))
df.head()

In [None]:
median_prices = df.median()

In [None]:
median_prices.tail()

In [None]:
marin_df = df[df["CountyName"] == "Marin"].median()
sf_df = df[df["City"] == "San Francisco"].median()
palo_alto = df[df["City"] == "Palo Alto"].median()
df_comparison = pd.concat([marin_df, sf_df, palo_alto, median_prices], axis=1)
df_comparison.columns = ["Marin County", "San Francisco", "Palo Alto", "Median USA"]

Install **Cufflinks**

In [None]:
#!pip install cufflinks

**Plotly visualization**

[Shortcut view of plot if slow to load](http://nbviewer.jupyter.org/github/noahgift/real_estate_ml/blob/648361ce7392a0af29ce79780e6e5159c1a378e9/notebooks/explore_zillow_data_sets.ipynb)

In [None]:
import cufflinks as cf
cf.go_offline()

from plotly.offline import init_notebook_mode
configure_plotly_browser_state()
init_notebook_mode(connected=False)


df_comparison.iplot(title="Bay Area Median Single Family Home Prices 1996-2017",
                    xTitle="Year",
                    yTitle="Sales Price",
                   #bestfit=True, bestfit_colors=["pink"],
                   #subplots=True,
                   shape=(4,1),
                    #subplot_titles=True,
                    fill=True,)

**Cluster on Size Rank and Price**

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
columns_to_drop = ['RegionID', 'ZipCode', 'City', 'State', 'Metro', 'CountyName']
df_numerical = df.dropna()
df_numerical = df_numerical.drop(columns_to_drop, axis=1)

In [None]:
df_numerical.describe()

In [None]:
scaler = MinMaxScaler()
scaled_df = scaler.fit_transform(df_numerical)
kmeans = KMeans(n_clusters=3, random_state=0).fit(scaled_df)
print(len(kmeans.labels_))

In [None]:
cluster_df = df.copy(deep=True)
cluster_df.dropna(inplace=True)
cluster_df.describe()
cluster_df['cluster'] = kmeans.labels_
cluster_df['appreciation_ratio'] = round(cluster_df["2017-09"]/cluster_df["1996-04"],2)
cluster_df['CityZipCodeAppRatio'] = cluster_df['City'].map(str) + "-" + cluster_df['ZipCode'] + "-" + cluster_df["appreciation_ratio"].map(str)
cluster_df.head()

**Create a 3D Plot**

[Shortcut view of plot if slow to load](http://nbviewer.jupyter.org/github/noahgift/real_estate_ml/blob/648361ce7392a0af29ce79780e6e5159c1a378e9/notebooks/explore_zillow_data_sets.ipynb)


In [None]:
import plotly.offline as py
import plotly.graph_objs as go

from plotly.offline import init_notebook_mode
configure_plotly_browser_state()
init_notebook_mode(connected=False)

trace1 = go.Scatter3d(
    x=cluster_df["appreciation_ratio"],
    y=cluster_df["1996-04"],
    z=cluster_df["2017-09"],
    mode='markers',
    text=cluster_df["CityZipCodeAppRatio"],
    marker=dict(
        size=12,
        color=cluster_df["cluster"],                # set color to an array/list of desired values
        colorscale='Viridis',   # choose a colorscale
        opacity=0.8
    )
)
#print(trace1)
data = [trace1]
layout = go.Layout(
    showlegend=False,
    title="30 Year History USA Real Estate Prices (Clusters Colored)",
    scene = dict(
        xaxis = dict(title='X: Appreciation Ratio'),
        yaxis = dict(title="Y:  1996 Prices"),
        zaxis = dict(title="Z:  2017 Prices"),
    ),
    width=1000,
    height=900,
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='3d-scatter-colorscale')

**NBA Player Endorsements Interactive Plotly Graph**

Reference:

*   https://plot.ly/~ngift/17/






In [None]:
from plotly.offline import init_notebook_mode
configure_plotly_browser_state()
init_notebook_mode(connected=False)


import plotly.offline as py
from plotly.graph_objs import *
trace1 = {
  "x": ["LeBron James", "Kevin Durant", "James Harden", "Russell Westbrook", "Carmelo Anthony", "Dwyane Wade", "Chris Paul", "Derrick Rose", "Kyrie Irving", "Stephen Curry"], 
  "y": [55, 36, 20, 15, 8, 13, 8, 14, 13, 35], 
  "name": "Endorsements in Millions", 
  "type": "bar", 
  "uid": "df2707", 
  "xsrc": "ngift:16:53adec", 
  "ysrc": "ngift:16:0e0504"
}
trace2 = {
  "x": ["LeBron James", "Kevin Durant", "James Harden", "Russell Westbrook", "Carmelo Anthony", "Dwyane Wade", "Chris Paul", "Derrick Rose", "Kyrie Irving", "Stephen Curry"], 
  "y": [14.7, 6.29, 3.28, 4.28, 3.77, 4.67, 2.69, 3.27, 4.8, 17.57], 
  "name": "Wikipedia Pageviews", 
  "type": "bar", 
  "uid": "c9d073", 
  "xsrc": "ngift:16:53adec", 
  "ysrc": "ngift:16:fea27a"
}
trace3 = {
  "x": ["LeBron James", "Kevin Durant", "James Harden", "Russell Westbrook", "Carmelo Anthony", "Dwyane Wade", "Chris Paul", "Derrick Rose", "Kyrie Irving", "Stephen Curry"], 
  "y": [20.43, 12.24, 15.54, 17.34, 5.26, 2.52, 13.48, 1.17, 8.28, 18.8], 
  "name": "Wins Attributed to Player", 
  "type": "bar", 
  "uid": "cfe1ac", 
  "xsrc": "ngift:16:53adec", 
  "ysrc": "ngift:16:f3c87e"
}
trace4 = {
  "x": ["LeBron James", "Kevin Durant", "James Harden", "Russell Westbrook", "Carmelo Anthony", "Dwyane Wade", "Chris Paul", "Derrick Rose", "Kyrie Irving", "Stephen Curry"], 
  "y": [30.96, 26.5, 26.5, 26.5, 24.56, 23.2, 22.87, 21.32, 17.64, 12.11], 
  "name": "Salary in Millions", 
  "type": "bar", 
  "uid": "f83635", 
  "xsrc": "ngift:16:53adec", 
  "ysrc": "ngift:16:2cdf3e"
}
trace5 = {
  "x": ["LeBron James", "Kevin Durant", "James Harden", "Russell Westbrook", "Carmelo Anthony", "Dwyane Wade", "Chris Paul", "Derrick Rose", "Kyrie Irving", "Stephen Curry"], 
  "y": [5.53, 1.43, 0.97, 2.13, 0.72, 0.35, 0.83, 1.86, 1.54, 12.28], 
  "name": "Twitter Favorite Count/1000", 
  "type": "bar", 
  "uid": "9d1aad", 
  "xsrc": "ngift:16:53adec", 
  "ysrc": "ngift:16:191da9"
}
data = Data([trace1, trace2, trace3, trace4, trace5])
layout = {
  "barmode": "group", 
  "title": "2016-2017 NBA Season Endorsement and Social Power", 
  "xaxis": {
    "autorange": True, 
    "range": [-0.5, 9.5], 
    "type": "category"
  }, 
  "yaxis": {
    "autorange": True, 
    "range": [0, 57.8947368421], 
    "type": "linear"
  }
}
fig = Figure(data=data, layout=layout)
py.iplot(fig, filename='3d-scatter-colorscale')