# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## Session 00: Introduction and Prerequisites

[&larr; Back to course webpage](http://datakolektiv.com/app_direct/introdsnontech/)

Feedback should be send to goran.milovanovic@datakolektiv.com. 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

### Goran S. Milovanović, PhD
<b>DataKolektiv, Chief Scientist & Owner</b>

![](../img/DK_Logo_100.png)

***

### 0. What do we want to do today?

Our goal in Session 00 is to prepare ourselves technically for what follows. 

- create a course root directory `DSS_Vol00_PythonDS_2023` somewhere on your machine;
- create a `venv` virtual environment named `dss00python2023` in that directory;
- activate the `dss00python2023` environment;
- perform necessary installations there;
- create a `dss03python2023` directory, where things will happen, in `DSS_Vol00_PythonDS_2023`;
- clone the [dss03python2023](https://github.com/datakolektiv/dss03python2023) repo from [DataKolektiv](https://github.com/datakolektiv) into `dss03python2023`;
- and we're ready to go!

Let's get on with the procedures, step by step:

**Step 1.** Create a course root directory and enter it

**shell:**

`mkdir DSS_Vol00_PythonDS_2023`

`cd DSS_Vol00_PythonDS_2023`

**Step 2.** Create a `venv` virtual environment named `dss00python2023` in that course root directory

**shell:**

`python3 -m venv dss00python2023`

**Step 3.** Activate the `dss00python2023` environment;

**shell:**

`source dss00python2023/bin/activate`

**Step 4.** Perform necessary installations

**shell:**

`pip install numpy`

`pip install pandas`

`pip install scipy`

`pip install matplotlib`

`pip install seaborn`

`pip install plotly`

`pip install statsmodels`

`pip install -U scikit-learn`

`pip install ipykernel`

**What packages have we installed?**

- [NumPy](https://numpy.org/) is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. Essentially, NumPy makes Python a vector programming language, similar to R and other languages that are tailored made for efficient computation in Statistical Machine Learning.

- [Pandas](https://pandas.pydata.org/) is is built on top of Numpy and use for data manipulation, management, and analysis. Most important, Pandas adds the `DataFrame` class to Python, providing the essential means to load, manage, persiste, and manipulate tabular data in Python.

- [SciPy](https://scipy.org/) SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics and many other classes of problems. Its [Statistical functions](https://docs.scipy.org/doc/scipy/reference/stats.html) module encompasses essential functions used in probability and mathematical statistics.

- [Matplotlib](https://matplotlib.org/) A standard visualization package.

- [Seaborn](https://seaborn.pydata.org/) An advanced visualization package, based on Matplotlib.

- [Plotly](https://plotly.com/) Plotly is an advanced, industrial standarard visualization package that creates powerfull interactive visualizations.

- [statsmodels](https://www.statsmodels.org/stable/index.html) statsmodels "... is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. An extensive list of result statistics are available for each estimator."

- [scikit-learn](https://scikit-learn.org/stable/) is a standard Python Machine Learning library. It encompasses tons of efficient implementations of ML algorithms and is widely used in production and research in Data Science and ML. 

- [IPython kernel](https://ipython.readthedocs.io/en/stable/install/kernel_install.html) - this is here to support work in Jupyter Notebooks. 

**Step 5.** Create a `dss03python2023` directory, where things will happen, in `DSS_Vol00_PythonDS_2023`, and enter it

**shell:**

`mkdir dss03python2023`

`cd dss03python2023`

**Step 6.** Clone the [dss03python2023](https://github.com/datakolektiv/dss03python2023) repo from [DataKolektiv](https://github.com/datakolektiv) into `dss03python2023`

**shell:**

`git clone https://github.com/datakolektiv/dss03python2023.git`

And we are ready to go. Now we can do things like...

### 1. Simple Linear Regression with `scikit-learn`

**Note.** If the installations went well, you should be able to execute all of the following by running this Jupyter Notebook.

Import `pandas` and `matplotlib`

In [1]:
# - pckgs
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Load some data from the Internet

In [2]:
# - data
from sklearn.datasets import load_iris
ds = load_iris()
data_set = pd.DataFrame(data=ds.data, columns=ds.feature_names)
data_set.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


The shape of data

In [3]:
data_set.shape

(150, 4)

Correlations

In [4]:
data_set.corr()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
sepal length (cm),1.0,-0.11757,0.871754,0.817941
sepal width (cm),-0.11757,1.0,-0.42844,-0.366126
petal length (cm),0.871754,-0.42844,1.0,0.962865
petal width (cm),0.817941,-0.366126,0.962865,1.0


Visualize correlation matrix w. Seaborn

In [5]:
# - Seaborn
import seaborn as sns
cor_mat = data_set.corr()
plt.figure(figsize=(16, 6))
sns.heatmap(cor_mat, annot = True)

Check

In [None]:
data_set.dtypes

In [None]:
data_set.shape

Perform Simple Linear Regression with `scikit-learn`: attempt to predict the `petal length (cm)` value from `sepal length (cm)`

In [None]:
# - numpy
import numpy as np
# - scikit-learn
from sklearn.linear_model import LinearRegression
# - fit linear model: 
# - reshape, since we have only one predictor
X = np.array(data_set['sepal length (cm)']).reshape(-1, 1)
y = np.array(data_set['petal length (cm)']).reshape(-1, 1)
# - fit linear model
linear_model = LinearRegression().fit(X, y)
# - report
print("The slope of the regression line is: " + str(np.double(linear_model.coef_)))
print("The intercept of the regression line is: " + str(np.double(linear_model.intercept_)))

Visualize

In [None]:
plt.figure(figsize=(8, 8))
plt.box(False)
plt.scatter(X, y, color = 'green', marker = '.')
plt.plot(X, linear_model.predict(X), color = 'blue', linewidth=.5)
plt.title('Iris: sepal length vs petal length')
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')
plt.show()

### 2. Intro Readings and Videos

- 

### 3. Highly Recommended To Do

- Watch [Python NumPy Tutorial for Beginners](https://www.youtube.com/watch?v=QUT1VHiLmmI)
- Read chapter [Introduction to NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html) from [Python Data Science Handbook, Jake VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/)

<hr>

Goran S. Milovanović

DataKolektiv, 2023

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.