# 1. Setting Up

We begin by importing the libraries we need: (1) Numpy (commonly refered to as `np`), and (2) Pandas (commonly refered to as `pd`).

In [2]:
import numpy as np
import pandas as pd

# 2. Introduction to Numpy

`numpy` is a popular python numerical processing library.

`numpy`'s primary data structure is the `numpy.array`. An array will store a sequence of values *of the same type*.  

## 2a: More Than One Dimensions

In machine learning, we rarely store data in one-dimensional arrays. Typically, we store data in 2D arrays, where each row is a datapoint, and each column represents an attribute of the datapoint (more on this in the `pandas` section later). Numpy arrays are great for representing multi-dimensional data efficiently.

In [3]:
# np.ones actually takes a tuple, specifying the rows and columns of the all ones matrix (2D array)
x = np.ones((3,4))
print(x)

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]


The `reshape` function allows us to take an array and change its shape while maintaining its data.

In [4]:
# Create an array of the values 0 to 20 (exclusive)
x = np.arange(20)
print('Before reshape')
print(x)
print()

Before reshape
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19]



In [5]:
# Reshape the array such that it has dimensions 5x4 (5 rows, 4 columns)
y = x.reshape((5,4))
print('After reshape')
print(y)

After reshape
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]


In [6]:
# What happens if we reshape to a different number of entries?

# Fewer entries
z = x.reshape((6,3))
print(z)

ValueError: cannot reshape array of size 20 into shape (6,3)

Create an array that contains the numbers 0-24.

In [8]:
arr = np.array([x for x in range(0, 25)])

Reshape the array to be 5x5.

In [9]:
arr.reshape(5, 5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

## 2b: Accessing Data

How do we access data at a particular location (e.g., a particular row and column) in an array? This process is referred to as **"indexing"**. If you are selecting multiple rows or columns, it is referred to as **"slicing"**.

In [10]:
# What will this cell output? 

# Access one value
print('First: y[1,2]')
y1 = y[1,2]
print(y1)
print()

First: y[1,2]
6



In [11]:
# Use slice notation [a:b, c:d], where pre:post has pre inclusive, post exclusive

# Slice first element of tuple
print('Second: y[3:5, 1]')
y2 = y[3:5, 1]
print(y2)
print()

Second: y[3:5, 1]
[13 17]



In [12]:
# Slice second element of tuple
print('Third: y[3, 1:3]')
y3 = y[3, 1:3]
print(y3)
print()

Third: y[3, 1:3]
[13 14]



In [13]:
# Slice both elements of tuple
print('Fourth: y[0:5, 2:5]')
y4 = y[0:5, 2:5]
print(y4)

Fourth: y[0:5, 2:5]
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]
 [18 19]]


In [14]:
# Slice notation has a "and everything else" syntax
print('Fifth: y[1, :]')
y5 = y[1, :]
print(y5) # Everything in the first row
print()

Fifth: y[1, :]
[4 5 6 7]



In [15]:
print('Sixth: y[:, 3]')
y6 = y[:, 3]
print(y6) # Everything in the third column
print()


Sixth: y[:, 3]
[ 3  7 11 15 19]



Access the elements in the 4th row of y, from the first two columns.

In [21]:
print(y[4,:2])

[16 17]


## 2c: Shape of Numpy Arrays

We often need to know how many datapoints are in our dataset (num rows), or how many attributes there are per point (num columns). This is referred to as the numpy.array's "shape."

In [22]:
# What do each of these return? How do you interpret the result?

print('y.shape')
print(y.shape)
print()

y.shape
(5, 4)



In [23]:
# What happens when we reshape the y4 array?

print('y4')
print(y4)
print()

print('y4.shape')
print(y4.shape)
print()


# Reshape it
y4_r = y4.reshape(2,5)

print('y4_r')
print(y4_r)
print()

print('y4_r.shape')
print(y4_r.shape)
print()

y4
[[ 2  3]
 [ 6  7]
 [10 11]
 [14 15]
 [18 19]]

y4.shape
(5, 2)

y4_r
[[ 2  3  6  7 10]
 [11 14 15 18 19]]

y4_r.shape
(2, 5)



Note, this is *not* the same as the transpose operation! When we reshape we maintain the order of the elements, left to right and top to bottom.

## 2d: Numpy Functions

Numpy has functions that can be applied to arrays and their subsets! Many of the standard functions we might want to use are supported.
- mean()
- max()
- min()

In [24]:
# Reusing y from above (digits 1 - 20 exclusive, in a 5x4 array)
print('y')
print(y)
print()

print('Mean y')
print(np.mean(y))
print()

print('min of the second column of y')
print(np.min(y[:,1]))
print()

# Technically, we could have also used our knowledge of the data to answer this question without computation.
# We know how the data is distributed across the array; in particular, elements increase left to right and top to bottom.
# Leveraging this knowledge would save us computation in situations with vast, many dimensional arrays.

y
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]]

Mean y
9.5

min of the second column of y
1



You can also get apply the function across particular axes. 

In [25]:
# Another syntax for numpy functions across arrays
print(np.max(y, axis=0))

[16 17 18 19]


In [26]:
print(np.max(y, axis=1))

# What are we returning here?

[ 3  7 11 15 19]


Axis 0 is the rows (down the columns), axis 1 is the columns (down the rows).

How do you find the numpy functions you need? Search Google for "numpy [function description]." Numpy has very useful documentation and examples that can help you understand its functions!

# 3. Introduction to Pandas

Pandas is used to represent dataframes. 

Imagine you're storing a dataset that consists of average home price and the crime rate for neighborhoods near Philadelphia. You could use a Numpy array where you always store the home price in column 1, the crime rate in column 2, etc. But it becomes ***very difficult*** to remember which column has what data, and it makes it hard for anyone else to understand your code. 

To overcome these challenges, we use Pandas dataframes, which lets us label each column with a description of the data it contains. 

First, let us load the dataset.

In [27]:
crime = pd.read_csv('Philadelphia_Crime_Rate_noNA.csv')

Let's begin by seeing what our dataset looks like:

In [28]:
crime

Unnamed: 0,HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
0,140463,14.0463,29.7,10.0,-1.0,Abington,Montgome
1,113033,11.3033,24.1,18.0,4.0,Ambler,Montgome
2,124186,12.4186,19.5,25.0,8.0,Aston,Delaware
3,110490,11.0490,49.4,25.0,2.7,Bensalem,Bucks
4,79124,7.9124,54.1,19.0,3.9,Bristol B.,Bucks
...,...,...,...,...,...,...,...
94,174232,17.4232,13.8,25.0,4.7,Westtown,Chester
95,196515,19.6515,29.9,16.0,1.8,Whitemarsh,Montgome
96,232714,23.2714,9.9,21.0,0.2,Willistown,Chester
97,245920,24.5920,22.6,10.0,0.3,Wynnewood,Montgome


If the dataset is particularly large we can look at only the first few rows using the `head` method

In [29]:
crime.head()

Unnamed: 0,HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
0,140463,14.0463,29.7,10.0,-1.0,Abington,Montgome
1,113033,11.3033,24.1,18.0,4.0,Ambler,Montgome
2,124186,12.4186,19.5,25.0,8.0,Aston,Delaware
3,110490,11.049,49.4,25.0,2.7,Bensalem,Bucks
4,79124,7.9124,54.1,19.0,3.9,Bristol B.,Bucks


Select all values in the `HousePrice` column

In [30]:
crime['HousePrice']

0     140463
1     113033
2     124186
3     110490
4      79124
       ...  
94    174232
95    196515
96    232714
97    245920
98    130953
Name: HousePrice, Length: 99, dtype: int64

Using a single column we can apply a variety of aggregate functions to summarize the data:
* min()
* max()
* mean()

In [33]:
crime['HousePrice'].min()

np.int64(28000)

In [34]:
crime['HousePrice'].max()

np.int64(475112)

In [35]:
crime['HousePrice'].mean()

np.float64(157835.60606060605)

## 3a: Accessing the Data

Different columns may have different types. Let's find out the type of each column!

In [36]:
crime.dtypes

HousePrice           int64
HsPrc ($10,000)    float64
CrimeRate          float64
MilesPhila         float64
PopChg             float64
Name                object
County              object
dtype: object

We can index into pandas dataframes and series using indicies using the `iloc` function

In [37]:
crime.iloc[10]

HousePrice           134342
HsPrc ($10,000)     13.4342
CrimeRate              17.3
MilesPhila             31.0
PopChg                  4.2
Name               Chalfont
County                Bucks
Name: 10, dtype: object

In [38]:
crime['HousePrice'].iloc[10]

np.int64(134342)

In [39]:
crime.iloc[10]['HousePrice']

np.int64(134342)

Get the house price of the city at row 73 of the dataset.

In [41]:
crime['HousePrice'].iloc[73]

np.int64(259506)

## 3b: Filters

One of the most powerful features of pandas is being able to filter data based on certain criteria.

Let's start by getting all rows in our dataset that are located in Bucks county

In [42]:
crime[crime["County"] == "Bucks"]

Unnamed: 0,HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
3,110490,11.049,49.4,25.0,2.7,Bensalem,Bucks
4,79124,7.9124,54.1,19.0,3.9,Bristol B.,Bucks
5,92634,9.2634,48.6,20.0,0.6,Bristol T.,Bucks
9,264298,26.4298,20.4,26.0,6.0,Buckingham,Bucks
10,134342,13.4342,17.3,31.0,4.2,Chalfont,Bucks
17,190317,19.0317,19.4,26.0,1.9,Doylestown,Bucks
24,114233,11.4233,29.0,30.0,1.3,Falls Town,Bucks
34,194435,19.4435,15.7,32.0,15.0,L. Makefield,Bucks
43,143072,14.3072,40.1,23.0,1.6,Middletown,Bucks
44,96769,9.6769,36.1,15.0,5.1,Morrisville,Bucks


What exactly is this doing? 

Filters can also be combined to have multiple criteria. Let's start by looking at all rows in the dataset that are located in Bucks county with a crimerate greater than 15

In [43]:
crime[(crime["County"] == "Bucks") & (crime["CrimeRate"] > 15)]

Unnamed: 0,HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
3,110490,11.049,49.4,25.0,2.7,Bensalem,Bucks
4,79124,7.9124,54.1,19.0,3.9,Bristol B.,Bucks
5,92634,9.2634,48.6,20.0,0.6,Bristol T.,Bucks
9,264298,26.4298,20.4,26.0,6.0,Buckingham,Bucks
10,134342,13.4342,17.3,31.0,4.2,Chalfont,Bucks
17,190317,19.0317,19.4,26.0,1.9,Doylestown,Bucks
24,114233,11.4233,29.0,30.0,1.3,Falls Town,Bucks
34,194435,19.4435,15.7,32.0,15.0,L. Makefield,Bucks
43,143072,14.3072,40.1,23.0,1.6,Middletown,Bucks
44,96769,9.6769,36.1,15.0,5.1,Morrisville,Bucks


How many rows are there in this dataset?

In [44]:
crime[(crime["County"] == "Bucks") & (crime["CrimeRate"] > 15)].shape

(15, 7)

Get the rows in Delaware county with home prices less than $100,000.

In [48]:
crime[(crime["County"] == "Delaware") & (crime["HousePrice"] < 100000)]

Unnamed: 0,HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
6,89246,8.9246,30.8,15.0,-2.6,Brookhaven,Delaware
12,77370,7.737,34.2,10.0,-1.2,Clifton,Delaware
14,40642,4.0642,45.7,15.0,0.0,Darby Bor.,Delaware
15,71359,7.1359,22.3,8.0,1.6,Darby Town,Delaware
25,74502,7.4502,21.4,15.0,-3.2,Follcroft,Delaware
27,97167,9.7167,29.3,10.0,0.2,Glenolden,Delaware
38,93738,9.3738,19.3,7.0,-0.4,Lansdown,Delaware
45,94014,9.4014,26.6,14.0,0.5,Morton,Delaware
54,99843,9.9843,12.5,12.0,-3.7,Norwood,Delaware
66,92215,9.2215,17.4,14.0,7.8,Prospect Park,Delaware


How many rows are there in this dataset?

In [50]:
crime[(crime["County"] == "Delaware") & (crime["HousePrice"] < 100000)].shape

(12, 7)

## 3c: Sorting

Sometimes, it is easier to view a dataset after sorting all the datapoints by a particular attribute.

In [51]:
crime.sort_values(by=['HousePrice'])

Unnamed: 0,HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
56,28000,2.8000,44.9,5.5,-8.4,"Phila, N",Phila
59,38000,3.8000,54.8,4.5,-5.1,"Phila, SW",Phila
60,38000,3.8000,53.5,2.0,-9.2,"Phila, South",Phila
14,40642,4.0642,45.7,15.0,0.0,Darby Bor.,Delaware
61,42000,4.2000,69.9,4.0,-5.7,"Phila, West",Phila
...,...,...,...,...,...,...,...
79,359112,35.9112,9.4,36.0,4.0,U. Makefield,Bucks
30,389302,38.9302,17.8,20.0,1.5,Horsham,Montgome
88,436348,43.6348,22.1,15.0,1.3,Villanova,Montgome
29,436348,43.6348,16.5,10.0,-0.7,Haverford,Delaware


In [52]:
crime.sort_values(by=['CrimeRate'], ascending=False)

Unnamed: 0,HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
62,96200,9.6200,366.1,0.0,4.8,"Phila,CC",Phila
52,71981,7.1981,73.3,19.0,4.9,Norristown,Montgome
18,215512,21.5512,71.9,26.0,5.8,E. Bradford,Chester
89,124478,12.4478,71.9,22.0,4.6,W. Chester,Chester
61,42000,4.2000,69.9,4.0,-5.7,"Phila, West",Phila
...,...,...,...,...,...,...,...
71,229711,22.9711,9.8,22.0,5.3,Schuylkill,Chester
79,359112,35.9112,9.4,36.0,4.0,U. Makefield,Bucks
73,259506,25.9506,7.2,40.0,17.4,Solebury,Bucks
53,169401,16.9401,7.1,22.0,1.5,Northampton,Bucks


Sort the dataset from highest population change (the fastest growing area) to least.

In [55]:
crime.sort_values(by="PopChg", ascending=False)

Unnamed: 0,HousePrice,"HsPrc ($10,000)",CrimeRate,MilesPhila,PopChg,Name,County
39,121024,12.1024,39.5,35.0,26.9,Limerick,Montgome
93,152624,15.2624,24.0,19.0,23.1,Warrington,Bucks
73,259506,25.9506,7.2,40.0,17.4,Solebury,Bucks
34,194435,19.4435,15.7,32.0,15.0,L. Makefield,Bucks
91,114157,11.4157,44.6,38.0,14.6,W. Whiteland,Chester
...,...,...,...,...,...,...,...
61,42000,4.2000,69.9,4.0,-5.7,"Phila, West",Phila
58,61800,6.1800,49.9,9.0,-6.4,"Phila, NW",Phila
56,28000,2.8000,44.9,5.5,-8.4,"Phila, N",Phila
60,38000,3.8000,53.5,2.0,-9.2,"Phila, South",Phila


## 3d: Problems

**[Discussion]** what biases can exist when using a dataset and training a model to predict crime rates?

### 1)
* a) print out all the rows in the dataset where the crimerate is between 15 and 25 percent
* b) how many entries in the dataset are there for this criteria?
* c) how many unique counties are there for this criteria?

For Part C) I haven't given you one of the functions that you need yet. Search Google for "pandas num unique" to find a relevant function in the documentation.

2)
* a) print out the average houseprice in Montgome county
* b) what is the most expensive house in the entire dataset? What county is it located in?

3) (Extra) what is the average price per county 
* [groupby documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

# 4. Introduction to sklearn/scikit-learn

`scikit-learn`/`sklearn` is a machine learning Python library used to implement machine learning models and perform statistical modeling.

In this example, we will use a dataset related to diabetes and various health metrics and glucose levels and try to **predict a person's blood sugar level.** 

Note that the `sklearn` datasets come pre-separated. What we mean by that is the data inputs are separate from the labels. The input data is stored in the `.data` field and the labels in the `.target` field.

First, let us load the dataset and investigate it.

In [3]:
from sklearn import datasets
diabetes = datasets.load_diabetes()

In [4]:
# Check the shape of the input data.
# 442 rows, 10 columns

diabetes.data.shape

In [5]:
# Check the shape of the target
# 442 rows (no columns, just an array)

diabetes.target.shape

In [6]:
# Names of the columns

diabetes.feature_names

In [7]:
# targets - actual blood sugar levels

diabetes.target

## 4a: Train a Model

To train a model, we will use sklearn's `LinearRegression` model. You can see the documentation for `LinearRegression` [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

How would you train a linear regression model?

From the documentation, you can see that there are lots of other functions / properties you can look at. 

Let's say you want to view the learned weights. How would you do that?

## 4b: Predicting values

How would you predict the blood sugar levels with your model?