# Data Science

What the heck is data science? I consider this a bit of a buzz word, a pervasive one but still vague in the precise definition. In the context of our bootcamp, we will consider the 'data science' modules `pandas` for loading datasets into a convenient format called a dataframe, and `sklearn` for regression.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# I like to import as pd
import pandas as pd

In [None]:
# Load an example dataset as a dataframe
df = pd.read_csv('./data/example_dataframe.csv',index_col=0)

In [None]:
# Display the dataframe 
df.columns

In [None]:
# Display the index array (all the different )
df.index

In [None]:
# Index the dataframe like a dictionary
df.loc['CHN']

In [None]:
df.loc['CHN']['POP']

In [None]:
df['POP']['CHN']

Now we want to use linear regression to build a prediction model with these data. This type of modeling is commonly referred to as 'machine learning', this being on the simplest end for types of models used.

In [None]:
from sklearn.linear_model import LinearRegression

pop = df['POP'].values
area = df['AREA'].values.reshape(-1,1)
model = LinearRegression().fit(area,pop)

areafit = np.linspace(0, 20000, 1000).reshape(-1,1)
popfit = model.predict(areafit)

plt.figure()
plt.plot(area,pop,'k.')
plt.plot(areafit,popfit,'r')

To be honest, this is a terrible model for this particular dataset, but you could play around with the different columns to determine which variables are most correlated and perhaps develop a more sophisticated model here.

## Questions

1) The data set that we have been working with are the global land and ocean temperature anomalies from [NOAA](https://www.ncdc.noaa.gov/cag/global/time-series/globe/land_ocean/all/1/1880-2022). Load the dataset with the header at ./data/dataset_with_header.csv, this time as a Pandas dataframe. Look at the header of the file. As always, this work should be done in DatasetAnswers.ipynb.

    a) Extrapolate these data forward to year 2100 using linear regression. What is your modeled temperature? Is this within the IPCC target of 1.5$^\circ$C outlined in the Paris Agreement? 
    
    b) Is linear regression appropriate here? Think about using some of the tools that you learned in prior notebooks to do preprocessing before regressing the raw data. For example, sort by month so that you can pull out the seasonality of the dataset, or do a median filter to quiet some of the noise (or both?). 