# 1. Setting Up

We begin by importing the libraries we need: (1) Numpy (commonly refered to as `np`), and (2) Pandas (commonly refered to as `pd`).

In [2]:
import numpy as np
import pandas as pd

# 2. Introduction to Numpy

`numpy` is a popular python numerical processing library.

`numpy`'s primary data structure is the `numpy.array`. An array will store a sequence of values *of the same type*.  

## 2a: More Than One Dimensions

In machine learning, we rarely store data in one-dimensional arrays. Typically, we store data in 2D arrays, where each row is a datapoint, and each column represents an attribute of the datapoint (more on this in the `pandas` section later). Numpy arrays are great for representing multi-dimensional data efficiently.

In [4]:
# np.ones actually takes a tuple, specifying the rows and columns of the all ones matrix (2D array)
x = np.ones((3,4))
print(x)

The `reshape` function allows us to take an array and change its shape while maintaining its data.

In [5]:
# Create an array of the values 0 to 20 (exclusive)
x = np.arange(20)
print('Before reshape')
print(x)
print()

In [0]:
# Reshape the array such that it has dimensions 5x4 (5 rows, 4 columns)
y = x.reshape((5,4))
print('After reshape')
print(y)

In [6]:
# What happens if we reshape to a different number of entries?

# Fewer entries
z = x.reshape((6,3))
print(z)

Create an array that contains the numbers 0-24.

In [0]:
arr = # something

Reshape the array to be 5x5.

## 2b: Accessing Data

How do we access data at a particular location (e.g., a particular row and column) in an array? This process is referred to as **"indexing"**. If you are selecting multiple rows or columns, it is referred to as **"slicing"**.

In [0]:
# What will this cell output? 

# Access one value
print('First: y[1,2]')
y1 = y[1,2]
print(y1)
print()

In [0]:
# Use slice notation [a:b, c:d], where pre:post has pre inclusive, post exclusive

# Slice first element of tuple
print('Second: y[3:5, 1]')
y2 = y[3:5, 1]
print(y2)
print()

In [0]:
# Slice second element of tuple
print('Third: y[3, 1:3]')
y3 = y[3, 1:3]
print(y3)
print()

In [0]:
# Slice both elements of tuple
print('Fourth: y[0:5, 2:5]')
y4 = y[0:5, 2:5]
print(y4)

In [0]:
# Slice notation has a "and everything else" syntax
print('Fifth: y[1, :]')
y5 = y[1, :]
print(y5) # Everything in the first row
print()

In [0]:
print('Sixth: y[:, 3]')
y6 = y[:, 3]
print(y6) # Everything in the third column
print()


Access the elements in the 4th row of y, from the first two columns.

## 2c: Shape of Numpy Arrays

We often need to know how many datapoints are in our dataset (num rows), or how many attributes there are per point (num columns). This is referred to as the numpy.array's "shape."

In [0]:
# What do each of these return? How do you interpret the result?

print('y.shape')
print(y.shape)
print()

In [0]:
# What happens when we reshape the y4 array?

print('y4')
print(y4)
print()

print('y4.shape')
print(y4.shape)
print()


# Reshape it
y4_r = y4.reshape(2,5)

print('y4_r')
print(y4_r)
print()

print('y4_r.shape')
print(y4_r.shape)
print()

Note, this is *not* the same as the transpose operation! When we reshape we maintain the order of the elements, left to right and top to bottom.

## 2d: Numpy Functions

Numpy has functions that can be applied to arrays and their subsets! Many of the standard functions we might want to use are supported.
- mean()
- max()
- min()

In [0]:
# Reusing y from above (digits 1 - 20 exclusive, in a 5x4 array)
print('y')
print(y)
print()

print('Mean y')
print(np.mean(y))
print()

print('min of the second column of y')
print(np.min(y[:,1]))
print()

# Technically, we could have also used our knowledge of the data to answer this question without computation.
# We know how the data is distributed across the array; in particular, elements increase left to right and top to bottom.
# Leveraging this knowledge would save us computation in situations with vast, many dimensional arrays.

You can also get apply the function across particular axes. 

In [0]:
# Another syntax for numpy functions across arrays
print(np.max(y, axis=0))

In [0]:
print(np.max(y, axis=1))

# What are we returning here?

Axis 0 is the rows (down the columns), axis 1 is the columns (down the rows).

How do you find the numpy functions you need? Search Google for "numpy [function description]." Numpy has very useful documentation and examples that can help you understand its functions!

# 3. Introduction to Pandas

Pandas is used to represent dataframes. 

Imagine you're storing a dataset that consists of average home price and the crime rate for neighborhoods near Philadelphia. You could use a Numpy array where you always store the home price in column 1, the crime rate in column 2, etc. But it becomes ***very difficult*** to remember which column has what data, and it makes it hard for anyone else to understand your code. 

To overcome these challenges, we use Pandas dataframes, which lets us label each column with a description of the data it contains. 

First, let us load the dataset.

In [3]:
crime = pd.read_csv('Philadelphia_Crime_Rate_noNA.csv')

Let's begin by seeing what our dataset looks like:

In [3]:
crime

If the dataset is particularly large we can look at only the first few rows using the `head` method

In [5]:
crime.head()

Select all values in the `HousePrice` column

In [8]:
crime['HousePrice']

Using a single column we can apply a variety of aggregate functions to summarize the data:
* min()
* max()
* mean()

In [6]:
crime['HousePrice']. # enter methods here.

In [0]:
crime['HousePrice']. # enter methods here.

In [0]:
crime['HousePrice']. # enter methods here.

## 3a: Accessing the Data

Different columns may have different types. Let's find out the type of each column!

In [0]:
crime.dtypes

We can index into pandas dataframes and series using indicies using the `iloc` function

In [0]:
crime.iloc[10]

In [0]:
crime['HousePrice'].iloc[10]

In [0]:
crime.iloc[10]['HousePrice']

Get the house price of the city at row 73 of the dataset.

## 3b: Filters

One of the most powerful features of pandas is being able to filter data based on certain criteria.

Let's start by getting all rows in our dataset that are located in Bucks county

In [1]:
crime[crime["County"] == "Bucks"]

What exactly is this doing? 

Filters can also be combined to have multiple criteria. Let's start by looking at all rows in the dataset that are located in Bucks county with a crimerate greater than 15

In [0]:
crime[(crime["County"] == "Bucks") & (crime["CrimeRate"] > 15)]

How many rows are there in this dataset?

In [0]:
crime[(crime["County"] == "Bucks") & (crime["CrimeRate"] > 15)].shape

Get the rows in Delaware county with home prices less than $100,000.

How many rows are there in this dataset?

## 3c: Sorting

Sometimes, it is easier to view a dataset after sorting all the datapoints by a particular attribute.

In [1]:
crime.sort_values(by=['HousePrice'])

In [0]:
crime.sort_values(by=['CrimeRate'], ascending=False)

Sort the dataset from highest population change (the fastest growing area) to least.

## 3d: Problems

**[Discussion]** what biases can exist when using a dataset and training a model to predict crime rates?

### 1)
* a) print out all the rows in the dataset where the crimerate is between 15 and 25 percent
* b) how many entries in the dataset are there for this criteria?
* c) how many unique counties are there for this criteria?

For Part C) I haven't given you one of the functions that you need yet. Search Google for "pandas num unique" to find a relevant function in the documentation.

2)
* a) print out the average houseprice in Montgome county
* b) what is the most expensive house in the entire dataset? What county is it located in?

3) (Extra) what is the average price per county 
* [groupby documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)

# 4. Introduction to sklearn/scikit-learn

`scikit-learn`/`sklearn` is a machine learning Python library used to implement machine learning models and perform statistical modeling.

In this example, we will use a dataset related to diabetes and various health metrics and glucose levels and try to **predict a person's blood sugar level.** 

Note that the `sklearn` datasets come pre-separated. What we mean by that is the data inputs are separate from the labels. The input data is stored in the `.data` field and the labels in the `.target` field.

First, let us load the dataset and investigate it.

In [3]:
from sklearn import datasets
diabetes = datasets.load_diabetes()

In [4]:
# Check the shape of the input data.
# 442 rows, 10 columns

diabetes.data.shape

In [5]:
# Check the shape of the target
# 442 rows (no columns, just an array)

diabetes.target.shape

In [6]:
# Names of the columns

diabetes.feature_names

In [7]:
# targets - actual blood sugar levels

diabetes.target

## 4a: Train a Model

To train a model, we will use sklearn's `LinearRegression` model. You can see the documentation for `LinearRegression` [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

How would you train a linear regression model?

From the documentation, you can see that there are lots of other functions / properties you can look at. 

Let's say you want to view the learned weights. How would you do that?

## 4b: Predicting values

How would you predict the blood sugar levels with your model?