# Diving into Data Science with Regression

Every data science project can and should follow a given structure. In this lesson, we introduce this structure and use one-dimensional regression as an example. To the best of our capabilities, the proposed structure should be followed in practice.

## PPDAC: Problem, Plan, Data, Analysis and Conclusion

![from creativemaths.net/blog](https://learnandteachstatistics.files.wordpress.com/2015/07/ppdac_complete_background.png)

- **Problem**: understanding and defining the problem, how do we go about answering?
- **Plan**: What to measure and how? Study design? Recording? Collecting?
- **Data**: collection, management, cleaning
- **Analysis**: sort data; construct tables, graphs; look for patterns; hypothesis generation
- **Conclusion**: Communication, Conclusions, Interpretation, New Ideas (Repeat Cycle if required)

For more details, see Wolff, A. et al, "Creating an Understanding of Data Literacy for a Data-driven Society", The Journal of Community Informatics, Vol 12, issue 3, 2016, doi:[10.15353/joci.v12i3.3275](https://www.ci-journal.net/index.php/JoCI/article/view/3275)

## Problem statement

![from commons.wikimedia](https://upload.wikimedia.org/wikipedia/commons/0/0d/Blue_iceberg_south_polar_circle.jpg)

- I like to know if the amount of ice on polar caps is really decreasing per year. 
- For this, I'd like to know how much ice is found during the coldest period of the polar year (March).
- it might be nice to know by how much the polar is getting less per year.

## Plan

- I need data from the polar region that reports the monthly ice area over last years. 
- It might be nice if this data goes back to the 19th century so that I can be sure of the trend.
- This data should not be large and can be stored locally so that I can always return to it.
- I am mostly interested on the ice surface area as a proxy metric/observable for the amount of ice.

## Data

- Updated June 2020
- Can be obtained from [here](http://sustainabilitymath.org/excel/ArcticIceDataMonth-R.csv) 
- Data from the National Snow and Ice Data Center: http://nsidc.org; Sea Ice Index page http://nsidc.org/data/g02135.html; Data Located at  ftp://sidads.colorado.edu/DATASETS/NOAA/G02135/
 
> Important Note: The "extent" column includes the area near the pole not
> imaged by the sensor. It is assumed to be entirely ice covered with at
> least 15% concentration. However, the "area" column excludes the area not
> imaged by the sensor. This area is 1.19 million square kilometers for SMMR
> (from the beginning of the series through June 1987) and 0.31 million
> square kilometers for SSM/I (from July 1987 to present). Therefore, there
> is a discontinuity in the "area" data values in this file at the June/July
> 1987 boundary.
> 



## Import data

- using `pandas` import the .csv-file of your data
- check data types and consistency

In [1]:
import pandas as pd
print("pandas version:", pd.__version__)

pandas version: 1.1.3


In [None]:
df = pd.read_csv("http://sustainabilitymath.org/excel/ArcticIceDataMonth-R.csv")

In [4]:
import numpy as np
print(f"numpy version: {np.__version__}")

numpy version: 1.19.1


## Analysis

Let's look at the data first.

In [5]:
import matplotlib.pyplot as plt
print(f"numpy version: {np.__version__}")
plt.style.use('dark_background')

numpy version: 1.19.1


Perform the linear regression

In [6]:
import sklearn
print("Scikit-learn version:", sklearn.__version__)

Scikit-learn version: 0.23.2


In [7]:
from sklearn import linear_model

In [None]:
Model assessment (MSE)

In [8]:
from sklearn.metrics import mean_squared_error

## Conclusion

What do the results tell you?