# Regression Analysis

This notebook demonstrates basic regression analysis on [weather in Szeged, Hungary](https://www.kaggle.com/budincsevity/szeged-weather).

Regression analysis is a statistical method used to understand the relationship between a dependent variable and one or more independent variables. In this notebook, we will use regression analysis to understand the relationship between temperature and humidity. First, download the dataset from the link above and copy it to the same directory as this notebook.

We will start by loading the data using `pandas` and print the first 5 rows.

In [2]:
import pandas as pd

# Read the data from the csv file
df = pd.read_csv('weatherHistory.csv')

# Print the first 5 rows of the dataframe.
df.head()

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


Let's use `scikit-learn` to perform regression analysis. We will use the `LinearRegression` class. This model finds the best linear relationship between the dependent variable and the independent variables. The `fit` method is used to train the model. The `predict` method is used to make predictions. For our first model, we will predict the temperature using humidity.

We will use the `mean_squared_error` function to compute the mean squared error of the model. The mean squared error is the average of the squared difference between the actual and predicted values. The lower the mean squared error, the better the model.

Additionally, we will want to validate our model on data that was not used for training it. For this, we can use the `train_test_split` function to split the data into training and testing sets. We will use 80% of the data for training and 20% for testing.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


# Create a list of features
features = ['Humidity']

# Create X and y.
X = df[features]
y = df['Temperature (C)']

# Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

# Instantiate and fit our model.
lr = LinearRegression()
lr.fit(X_train, y_train)

# Print out the R^2 for the model against the full dataset.
print(lr.score(X_test, y_test))


0.39578649233601326


The R-squared value is a measure of how well the model fits the data. It is the proportion of the variance in the dependent variable that is predictable from the independent variables. The higher the R-squared value, the better the model. For our first feature, we see that it explains 40% of the variance in the dependent variable. Let's repeat this process for all features.

In [26]:
# Create a list of all features
features = ['Humidity', 'Wind Speed (km/h)', 'Wind Bearing (degrees)', 'Visibility (km)', 'Pressure (millibars)']

# Create X and y.
X = df[features]
y = df['Temperature (C)']

# Split the data into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

# Loop through and print each feature name, and the R^2 score for the model using only that feature.
for feature in features:
    print('** {} **'.format(feature))
    X_train_feat = X_train[[feature]]
    X_test_feat = X_test[[feature]]
    lr = LinearRegression()
    lr.fit(X_train_feat, y_train)
    print('Train R^2 score: {}'.format(lr.score(X_train_feat, y_train)))
    print('Test R^2 score:  {}\n'.format(lr.score(X_test_feat, y_test)))


** Humidity **
Train R^2 score: 0.4007432667666295
Test R^2 score:  0.39578649233601326

** Wind Speed (km/h) **
Train R^2 score: 6.883483791153555e-05
Test R^2 score:  8.543479810196875e-05

** Wind Bearing (degrees) **
Train R^2 score: 0.0007355856946651418
Test R^2 score:  0.0014678220112629425

** Visibility (km) **
Train R^2 score: 0.15373604790945894
Test R^2 score:  0.15663814028015022

** Pressure (millibars) **
Train R^2 score: 2.6252850073849032e-05
Test R^2 score:  5.060108274390629e-06



Based on the results, we can see that humidity is the best predictor of temperature.