# Elementary Python Colab Tutorial:

## Link to datasets and a copy of this colab file in my github repository: 
https://github.com/aghababa/Elementary-Python-Tutorial

One way to work with this Colab file is to clone the folder there and copy/paste its files into your google drive. 

In [1]:
import numpy as np 
from scipy import linalg as LA
import pandas as pd
import random 
import matplotlib as mpl
import matplotlib.pyplot as plt 

# Importing files from google drive

---

# Mounting Google Drive
### We can access files in our google drive using mounting Google Drive, i.e., setting up the google drive account as a virtual drive. Thus we can access the resources of the drive like a local drive in our computer.

### To connect Google Drive with Colab, we can execute the following two lines of code in Colab:

In [None]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


# The dataset we are going to use is a simple regression task data from kaggle which can be downloaded from this link: 
https://www.kaggle.com/luddarell/101-simple-linear-regressioncsv

---

# Importing data from your google drive, assuming data is copied/pasted there

## (I have pastaed Data.csv and Data1.csv files both in my google drive and in a folder named Python Tutorial in my google drive)

## Visualizing data in the form of dataframe (need "import pandas as pd")

#### The dataset is from kaggle: https://www.kaggle.com/mayanksrivastava/predict-housing-prices-simple-linear-regression

However we will not use this data. We just use it for introducing pandas dataframe which is great for visualization. And I'm using only 100 rows and some columns. 

In [None]:
# if data is copied in google drive
pd.read_csv('/content/gdrive/My Drive/Data1.csv')#[:5]

In [None]:
#if data is in a folder in your goole drive, use the following
pd.read_csv('/content/gdrive/My Drive/Python Tutorial/Data.csv') 

# Importing data from your computer
#### To import data execute the following two lines of code in Colab and then choose your data by browsing "Choose Files" botton. 

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
pd.read_csv("Data1.csv")[:5]

#Downloading Data from Colab into a Local Drive in Your Computer

We can download data into local directories by executing the following two lines of codes. Here we assume that the dataset is in CSV format.

In [None]:
from google.colab import files

#if data is on the google drive
files.download('/content/gdrive/My Drive/Data.csv') 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
#if data is in a folder in google drive, use the following
files.download('/content/gdrive/My Drive/Python Tutorial/Data1.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# $\bf Regression$

### We are going to do a simple regression task in order to be familier with some basic operations and fuctions needed in this course.

#### The first function we need is reading a file like a csv file. 





In [None]:
def read_file(file_name):
  data = []
  with open(file_name, "r") as f:
    for line in f:
      item = line.strip().split(",")
      data.append(np.array(item))
  return data

## Reading x and y values

In [None]:
data = read_file('/content/gdrive/My Drive/Data.csv')[1:]

In [None]:
len(data)

84

In [None]:
random.shuffle(data)
data = np.array(data)

In [None]:
data

In [None]:
x_values = data[:,0]
y_values = data[:,1]

In [None]:
x = np.zeros(len(x_values))
for i in range(len(x_values)):
  x[i] = float(x_values[i])

In [None]:
x

In [None]:
y = np.zeros(len(y_values))
for i in range(len(y_values)):
  y[i] = float(y_values[i])

In [None]:
y

## Train/Test split of data (75/25%)

In [None]:
data_train = x[:63]
data_test = x[63:]

y_train = y[:63]
y_test = y[63:]

## Linear Regression
### Obtaining model parameters, i.e., $a$ and $b$ for the linear model $\ell(x) = a x + b$:



*   $x_{ave} = \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i$, $\quad y_{ave} = \overline{y} = \frac{1}{n} \sum_{i=1}^n y_i$
*   $\overline{X} = (x_1-\overline{x}, \ldots, x_n-\overline{x})$, $\quad \overline{Y} = (x_1-\overline{y}, \ldots, x_n-\overline{y})$

*   $a = \langle \overline{X}, \overline{Y}\rangle/\|\overline{X}\|^2$, $\quad b = \overline{y} - a \cdot \overline{x}$.

In [None]:
n = len(data_train)

x_ave = sum(data_train)/n
y_ave = sum(y_train)/n

X_bar = data_train - x_ave
Y_bar = y_train - y_ave

In [None]:
a = np.dot(X_bar, Y_bar)/LA.norm(X_bar)**2
b = y_ave - a * x_ave

In [None]:
a,b

## Calculating the residuals $r_i = y_i - \hat{y}_i$, where $\hat{y} = \ell(x_i) = ax_i+b$

In [None]:
y_hat = np.zeros(21)

for i in range(21):
  y_hat[i] = a * data_test[i] + b

In [None]:
y_hat = a * data_test + b

In [None]:
r = y_test - y_hat #or y_hat - y_test

In [None]:
r

# Plotting data and the Model

In [None]:
plt.scatter(data_train, y_train);
plt.scatter(data_test, y_test, color = 'red');
plt.show()

$\ell(x) = ax + b$

In [None]:
def lin_reg(x):
  return a * x + b

In [None]:
plt.scatter(data_train, y_train);
plt.scatter(data_test, y_test, color = 'red');

xlist = np.linspace(1600, 2100, 200)
plt.plot(xlist, lin_reg(xlist), 'g', linewidth=2)

plt.show()

#Recap

## We covered the following:

*   How to set up the google drive account as a virtual drive.
*   How to import data/files from our local drive to google drive.
*   How to read files from google drive in colab.
*   How to visualize data using pandas.
*   How to download files from colab. 
*   How to do some elementary operations in python using numpy.
*   How to plot a function in python. 

