# Task 4 - Implement 10.16 and 15.4 

## Author: Alison Hatfield 
## Date: Oct 2023
## [Project Repository](https://github.com/ajhatfield/datafun-07-ml-predictive)


## Part 1 - Linear Regression.

Load in packages needed. 

In [2]:
%matplotlib
import matplotlib
import matplotlib.pyplot as plt

Using matplotlib backend: <object object at 0x107fc7c80>


Create lambda for the Celsius/Fahrenheit formula. 
Store Fahrenheit/Celsius pair as a tumple in temps.

In [3]:
c = lambda f: 5/9*(f-32)

temps = [(f,c(f)) for f in range(0,101,10)]

Use pandas DataFrames to plot Celsius vs Fahrenheit 

In [4]:
import pandas as pd

temps_df = pd.DataFrame(temps, columns=['Fahrenheit', 'Celsius'])

axes = temps_df.plot(x = 'Fahrenheit', y = 'Celsius', style= '.-')

y_label = axes.set_ylabel('Celsius')

## Section 1 - Load: Follow the instructions to load NY City January high temperature from a csv file into a DataFrame

load and display the New York City data from ave_hi_nyc_jan_1895-2018.csv:

In [5]:
nyc = pd.read_csv('ave_hi_nyc_jan_1895-2018.csv')

## Section 2 - View: Follow the instructions to view head and tail of the file. 

In [6]:
nyc.head()

Unnamed: 0,Date,Value,Anomaly
0,189501,34.2,-3.2
1,189601,34.7,-2.7
2,189701,35.5,-1.9
3,189801,39.6,2.2
4,189901,36.4,-1.0


In [7]:
nyc.tail()

Unnamed: 0,Date,Value,Anomaly
119,201401,35.5,-1.9
120,201501,36.1,-1.3
121,201601,40.8,3.4
122,201701,42.8,5.4
123,201801,38.7,1.3


## Section 3 - Clean: Follow the instructions to clean the data. 

renaming 'Value' to 'Temperature'

In [8]:
nyc.columns = ['Date', 'Temperature', 'Anomaly']

nyc.head()

Unnamed: 0,Date,Temperature,Anomaly
0,189501,34.2,-3.2
1,189601,34.7,-2.7
2,189701,35.5,-1.9
3,189801,39.6,2.2
4,189901,36.4,-1.0


check the date column's type. 

In [9]:
nyc.Date.dtype

dtype('int64')

divide by 100 to truncate the last 2 digits since the values are integers. 

In [10]:
nyc.Date = nyc.Date.floordiv(100)

nyc.head()

Unnamed: 0,Date,Temperature,Anomaly
0,1895,34.2,-3.2
1,1896,34.7,-2.7
2,1897,35.5,-1.9
3,1898,39.6,2.2
4,1899,36.4,-1.0


## Section 4 - Describe: Use describe() to calculate basic descriptive statistics for the dataset. 

Gather some quick statistics on the data using describe function. 

In [20]:
pd.set_option('display.precision', 2)

nyc.Temperature.describe()

count    124.00
mean      37.60
std        4.54
min       26.10
25%       34.58
50%       37.60
75%       40.60
max       47.60
Name: Temperature, dtype: float64

## Section 5 - Calculate Line: Use the SciPy stats module linregress function to calculate slope and intercept for the best fit line through the data.

In [23]:
from scipy import stats


linear_regression = stats.linregress(x=nyc.Date, y=nyc.Temperature)

return the regression line's slope and intercept

In [24]:
linear_regression.slope

0.014771361132966163

In [25]:
linear_regression.intercept

8.694993233674289

## Section 6 - Predict: Use your results to predict the "average high temp in Jan" for the year 2026. 

In [30]:
linear_regression.slope * 2026 + linear_regression.intercept

38.62177088906374

approx. the average tempature for Jan of 1890

In [27]:
linear_regression.slope * 1890 + linear_regression.intercept

36.612865774980335

## Section 7 - Plot: Follow the instructions and use Seaborn to generate a scatter plot with a best fit line. Set the axes and y limit as instructed.

Plotting the average high temp and a regression line using Seaborn's regplot function. 

In [28]:
import seaborn as sns

sns.set_style('whitegrid')

axes = sns.regplot(x=nyc.Date, y=nyc.Temperature)

Scale the axes

In [29]:
axes.set_ylim(10,70)

(10.0, 70.0)

#  Part 2 - Machine Learning.

## Section 1 - Make sure data is still loaded and clean - we will be using same csv file from Part 1

In [31]:
nyc.head(3)

Unnamed: 0,Date,Temperature,Anomaly
0,1895,34.2,-3.2
1,1896,34.7,-2.7
2,1897,35.5,-1.9


## Section 2 - Splitting the Data fro Training and Testing

In [32]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(nyc.Date.values.reshape(-1,1), nyc.Temperature.values,random_state=11)

In [33]:
X_train.shape

(93, 1)

In [34]:
X_test.shape

(31, 1)

## Section 3 - Training the Model

In [35]:
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()

linear_regression.fit(X=X_train, y=y_train)

In [36]:
linear_regression.coef_

array([0.01939167])

In [37]:
linear_regression.intercept_

-0.30779820252656265

## Section 4 - Testing the Model

In [39]:
predicted = linear_regression.predict(X_test)

expected = y_test

for p, e in zip(predicted[::5], expected[::5]):
    print(f'predicted: {p: .2f}, expected: {e:.2f}')

predicted:  37.86, expected: 31.70
predicted:  38.69, expected: 34.80
predicted:  37.00, expected: 39.40
predicted:  37.25, expected: 45.70
predicted:  38.05, expected: 32.30
predicted:  37.64, expected: 33.80
predicted:  36.94, expected: 39.70


## Section 5 - Predicting Future Temperatures and Estimating Past Temperatures

In [41]:
predict = (lambda x: linear_regression.coef_ * x + linear_regression.intercept_)

predict(2026)

array([38.97973189])

In [42]:
predict(1890)

array([36.34246432])

## Section 6 - Visualizing the Dataset with the Regression Line

In [43]:
axes = sns.scatterplot(data=nyc, x='Date', y= 'Temperature', hue='Temperature', palette='winter', legend= False)

  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):
  if pd.api.types.is_categorical_dtype(vector):


In [44]:
axes.set_ylim(10,70)

(10.0, 70.0)

In [45]:
import numpy as np

x = np.array([min(nyc.Date.values), max(nyc.Date.values)])

y = predict(x)

In [46]:
plt.plot(x,y)

[<matplotlib.lines.Line2D at 0x1222ba0d0>]

# Part 3 - At the end of your notebook, add a final section with some remarks comparing the two methods.

The machine learning seemed like an easier way to predict the future or what happened in the past. Other than that, however, it seemed like two different ways to get a fairly simliar result. The chapter 10 linear regression version seemed easier in terms of the code it took to get the desired result, but the machine learning way had its positives as well. It seemed like the machine learning way, you could do a lot more things with rather than just simple linear regression. 