# Intro to Data Science with Python



### Conda 

https://conda.io/docs/

Conda was created as a package manager for data scientists to deal with non-python python dependencies. For example Numpy. Numpy works because of c-bindings that allow you to create the numpy arrays, which are central to data science in python. To install Numpy you used to need a C compiler. Conda allowed data scientists, who often had less expereince with configuration than developers to install one package manager that bundled the libraries that data scientists use. Conda comes with two bundles, mini-conda and andaconda. Andaconda is a bundle of over 150 scientific libraries that will take 3 GB of disk space. When you are starting out conda can be a great way to start without the hassle of thinking about environments and versioning. Many beginners pollute the global scope, which can lead to painful learning lessons- which is okay. 

Becuase of pip-wheels and maturation in python tools Conda is no longer as necessary, though it can still be useful. I use virtualenv wrapper which lets me create virtualenvs that I then pip install into whatever dependencies I need. For data science you can have one virtualenv that you install all of the tools you usually use and just work in that virtualenv. 

### Libraries we will love and know

**Base Tools**
1. Jupyter Notebook (IDE of choice)
2. Numpy (matricies and linear algebra operations)
3. Pandas (excel in python, library for manipulating data en masse)

**Data Visualization**
1. Matplotlib(basically the numpy for data vis, other tools are built on top of this)
2. Seaborn (make your matplotlibs beautiful)
3. Bokeh (works in the style of d3.js)

**Data Operations**
1. Scikit-learn (machine learning)
2. Scipy (scientific and engineering computing)

### Jupyter Notebook

Google *jupyter notebook online* and click this link: https://tmp60.tmpnb.org

Go to this URL to mess with a jupyter notebook without having to download anything. Even if you don't go any further with data science jupyter notebooks can be a great way to take notes, import images, and video, and evaluate code. In data science its a fantastic way of interfacing with data and sharing code with people. 

Jupyter can run Python, R, Julia, and it can render Latex. There are also kernels for jupyter that run haskell, which is amazing. For those that use Vim jupyter has a similar modal interface. With a command and edit mode.

Go to Jupyter and mess around a little

**Great Jupyter shortcuts**
**Command mode**
1. *m*: change cell to markdown 
2. *a* and *b*: Insert cell before and after the current cell
3. *h*: Launches help, which shows all the possible commads

**Edit mode**
1. *Shift + tab*: Tooltip with signature information about a method


### Numpy 

Numpy, specifically Numpy arrays are the engine that makes data science in python run. Numpy arrays are a typed array that is homogenous and densely packed. Numpy arrays take up less memory and Numpy operations are implemented in C, and avoid the general cost of loops in Python. Most Numpy operations involve vectorization for huge performance gains that use SIMD (single instruction multiple data) instructions on the CPU. These vecotrization operations and the resulting parrallization of data is the same technique behind GPU and now TCU performance. They are designed for millions of parallel operations over homogenous datatypes. 

Numpy isn't meant for referencing a single element on an array, so if your background is programming for the web you have to begin thinking in this parallel style- pandas will help you a lot with this. In numpy there is this notion called *broadcasting* where we multiply a vector (of matching length to the matrix) or scalar over a larger nth dimensional matrix. Both the paradigm and the implementation of many data science tools are linear algebra and being able to visualize and think about them in those terms is helpful. Also functional programming helps, and some data science specific languages are funcitonal languages like Mathematic or are inspired by functional paradigms like R. 

![2017-11-20-16-53-www.mathsisfun.com.png](attachment:2017-11-20-16-53-www.mathsisfun.com.png)

```python
import numpy as np

a = np.ones([4, 4])
np.shape(a)
b = np.array([1, 2, 3, 4])
np.shape(b)
c = a*b
d = a.multiply(b)
e = 17*c
f = c.reshape([8, 2])
g = np.transpose(f)
```

**Essential Numpy Functions**
1. np.arange
2. np.dot
3. np.reshape
4. np.transpose
5. np.multiply
6. np.divide
6. np.matrix
7. np.ones #identity matrix
8. np.zeroes


### Pandas
Pandas is built around the unit the dataframe. A dataframe is a series of numpy arrays. You can imagine it like an excel sheet, because it is very much an excel sheet. 

Here is a youtuber who does really nice videos about the essential operations of pandas and other python data science libraries.
https://www.youtube.com/channel/UCh9nVJoWXmFb7sLApWGcLPQ

```python 
import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()

# these are the same numpy arrays we were
# just using
a = pd.Series([5, 6, 2, 9])
a.values.reshape([2,2])

link = 'http://www.randalolson.com/wp-content/uploads/percent-bachelors-degrees-women-usa.csv'
data = pd.read_csv(link)

b = {'col1': [1, 2], 'col2': [3, 4]}
c = pd.DataFrame(data = b)

```

**Essential Pandas Functions on DataFrame**
1. .head #use to look at the data
2. .tail
3. .read_csv
4. .iloc # these two functions are used for                  splitting DataFrames
5. .loc 
6.  

### Seaborn 
This article does a great job of showing how beautiful your charts can get with some effort. 

https://www.dataquest.io/blog/making-538-plots/
![2017-11-20-15-16-www.dataquest.io.png](attachment:2017-11-20-15-16-www.dataquest.io.png)

### Data for practice

https://archive.ics.uci.edu/ml/datasets.html
https://www.kaggle.com/datasets

Look at the Kernels, they are often jupyter notebooks that are uploaded and walk through some aspect of data manipulaiton or an introduction to different data sets. 