# Intro to Data Science - Part 1: Numpy and Pandas

On this stage of AceleraDev Data Science we'll focus on two of the most important tools to work with data in Python: `numpy` anda `pandas`.

This content challenge comes after of two interesting lectures, which followed the First Machine Learning project step-by-step practice. The first one is a text written by José Portilla reposted  on KDnuggets titled [How to Become a Data Scientist: The Definitive Guide](https://www.kdnuggets.com/2017/08/become-data-scientist-definitive-guide.html). The second one, in turn, was Building Data
Science Teams: The Skills, Tools, and Perspectives Behind Great Data Science Groups (Patil, DJ; 2011), an O'Reilly publication that discuss main concepts and views over data science teams and how to think about them.

Now, we were oriented to follow the tutorial [Intro to Data Science - Part 1: Numpy and Pandas](https://towardsdatascience.com/intro-to-data-science-part-1-numpy-and-pandas-49d98740661b), from Tiffany Souterre and published on Towards Data Science. As it explained by Souterre, this tutorial covers arrays creation and manipulatian with Numpy and also series and data frames with Pandas. It pretends to teach:

- Create, index, slice, manipulate numpy arrays
- Create a matrix with a 2D numpy array
- Apply arithmetics to numpy arrays
- Apply mathematical function to numpy arrays (mean and dot product)
- Create, index, slice, manipulate pandas series
- Create a pandas data frame
- Select data frame rows through slicing, individual index (iloc or loc), boolean indexing

## `Numpy`: arrays and matrices

Numpy brings to Python multidimensional arrays and matrices support. Its basic data structure is the ***array***, which is like lists of the Python base. But there are one difference: in arrays objects must have the same type.

As usual, the first step to use some librarie is to import it in the environment.

In [10]:
# Import numpy
import numpy as np

And after enabling its functions I can start to work with it.

### Creating arrays and matrices

In [182]:
# An unidimensional array
array = np.array([1, 4, 5, 8], float)
print(array)

[1. 4. 5. 8.]


Above I created a simple array, unidimensional, using `np.array()`. The first argument passed in the function was a list (`[1, 4, 5, 8]`) and the second argument was the data type (`float`).

It's the same to create a 2D array, except that we need to use a two element list, intead an one element list. 

In [181]:
# A 2D array/matrix
matrix = np.array([[1, 2, 3, 4], [4, 5, 6, 7], [7, 8, 9, 10]], float)  
print(matrix)

[[ 1.  2.  3.  4.]
 [ 4.  5.  6.  7.]
 [ 7.  8.  9. 10.]]


### Indexing, slicing and manipulating arrays

Learn to perform these tasks with Numpy is preety straightforward, as it uses the same logic as Python base. For example, I can extract a scalar indexing the second element on `array` object created above as simple as this:

In [223]:
array[1]

4.0

Or extract all values till second element, like this:

In [221]:
array[:2]

array([1., 4.])

Manipulation is simple too. To change the value of the second element, it's just:

In [154]:
array[1] = 5.0
array[1]

5.0

### Indexing, slicing and manipulating matrices

And indexing matrices is like an extension of array indexing. The differences resides in the fact that matrices is a crossed combination of horizontal and vertical vectors, or observations and variables arrays.

For example, to extract the value of the second column, for the first line:

In [38]:
matrix[0, 1]

2.0

Which performs the same as

In [39]:
matrix[0][1]

2.0

If i want to slice all line 2:

In [63]:
matrix[1, :]

array([4., 5., 6.])

Or if I want all collumn 3:

In [87]:
matrix[:, 2]

array([3., 6.])

### Arithmetic operations with `numpy`

Another very important resources implemented by `numpy` are related with arithmetic and mathematical operations. Python base lacks computations like mean, median and standard deviations and `numpy` fill this vacuum.

In [237]:
# calculate sum, difference and product
Sum = array + array
Diff = array - array
Product = array * array
results = np.array([Sum, Diff, Product], float)
results

array([[ 2.,  8., 10., 16.],
       [ 0.,  0.,  0.,  0.],
       [ 1., 16., 25., 64.]])

Above I computed some array agains array operations, then I have puted them in a matrix to print. This is useful to show some other examples. First of them, a can proceed arithmetical operations for some peaces of data slicing it.

In [206]:
print(results[0] + matrix[2])

[ 9. 16. 19. 26.]


In [207]:
print(matrix[0, 1] + results[2, 2])

27.0


And I can make these operations for entire matrices either.

In [238]:
# calculate sum, difference and product for matrices
Sum = matrix + results
Diff = matrix - results
Product = matrix * results
results = np.array([Sum, Diff, Product], float)
print(results)

[[[  3.  10.  13.  20.]
  [  4.   5.   6.   7.]
  [  8.  24.  34.  74.]]

 [[ -1.  -6.  -7. -12.]
  [  4.   5.   6.   7.]
  [  6.  -8. -16. -54.]]

 [[  2.  16.  30.  64.]
  [  0.   0.   0.   0.]
  [  7. 128. 225. 640.]]]


For mean, median and standard deviations`numpy` brings some functions. Let's see how to use them with arrays and matrices, but first I will print the array to remember its values.

In [210]:
print(array)

[1. 4. 5. 8.]


What are the mean, the median and the standard deviation for array?

In [235]:
mean = np.mean(array)
median = np.median(array)
sd = np.std(array)

print('Mean: ', mean, '; Median: ', median, '; Std. Deviation: ', sd, sep = '')

Mean: 4.5; Median: 4.5; Std. Deviation: 2.5


For matrices we can operate in all dataset:

In [274]:
matrix.mean()

5.5

Or we can compute among matrix elements, choosing how the computation behavior.

In [275]:
results.mean(0) # for the same [i,i] position among matrix elements

array([[  1.33333333,   6.66666667,  12.        ,  24.        ],
       [  2.66666667,   3.33333333,   4.        ,   4.66666667],
       [  7.        ,  48.        ,  81.        , 220.        ]])

In [276]:
results.mean(1) # for each column on matrix elements

array([[  5.        ,  13.        ,  17.66666667,  33.66666667],
       [  3.        ,  -3.        ,  -5.66666667, -19.66666667],
       [  3.        ,  48.        ,  85.        , 234.66666667]])

In [277]:
results.mean(2) # ror each row on matrix elements

array([[ 11.5,   5.5,  35. ],
       [ -6.5,   5.5, -18. ],
       [ 28. ,   0. , 250. ]])

Obviusly this is just a picture of all possibilities offered by `numpy` and much more can be done. For a deepier dive on `numpy`, its [reference](https://docs.scipy.org/doc/numpy/reference/) is a good starting point.

## Pandas: series and dataframes 

With `pandas` Python is enabled to work with dataframes, a very well suited da structure to perform data analysis. It's basically the same as work with dataframes in R, but in Python.

Let's import `pandas`.

In [291]:
import pandas as pd

### Series

The basic structure of pandas is a **Series**, which is unidimensional data like a Python list, a `numpy` array or a vector in R.

In [292]:
series = pd.Series(['Dave', 'Cheng-Han', 'Udacity', 42, -1789710578])
series

0           Dave
1      Cheng-Han
2        Udacity
3             42
4    -1789710578
dtype: object

Differently than an array, a Series can contain different data types in it. Moreover, each element in a Series is indexed by an integer from $0$ to $n$. But we can specify different indexes.

In [295]:
series = pd.Series(
    ['Dave', 'Cheng-Han', 359, 9001],
    index=['Instructor', 'Curriculum Manager', 'Course Number', 'Power Level']
)

print(series)

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
Power Level                9001
dtype: object


And use these particular indexes to slice data in a smooth and intuitive way.

In [301]:
series['Instructor']

'Dave'

In [300]:
series[['Instructor', 'Curriculum Manager', 'Course Number']]

Instructor                 Dave
Curriculum Manager    Cheng-Han
Course Number               359
dtype: object


Another way to select itens of a Series is using boolean operators.

In [316]:
cuteness = pd.Series([1, 2, 3, 4, 5],
                     index=['Cockroach', 'Fish', 'Mini Pig', 'Puppy', 'Kitten'])
print(cuteness)

Cockroach    1
Fish         2
Mini Pig     3
Puppy        4
Kitten       5
dtype: int64


In [317]:
cuteness > 3

Cockroach    False
Fish         False
Mini Pig     False
Puppy         True
Kitten        True
dtype: bool

In [318]:
cuteness[cuteness > 3]

Puppy     4
Kitten    5
dtype: int64

# Dataframes

Dataframes are the bidimensional data structure of `pandas`. It's like a matrix, but it allows to store different types of data and, in adittion, comes with a lot of usefull embeded tools.

In [307]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}
football = pd.DataFrame(data)
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


Basically, pandas DataFrames are a set of crossed Series. Each row or each column is a Series and it's possible to index the columns, by their names.

In [373]:
football['year']

0    2010
1    2011
2    2012
3    2011
4    2012
5    2010
6    2011
7    2012
Name: year, dtype: int64

In [357]:
# shorthand for football['year']
football.year

0    2010
1    2011
2    2012
3    2011
4    2012
5    2010
6    2011
7    2012
Name: year, dtype: int64

In [358]:
football[['year', 'wins', 'losses']]

Unnamed: 0,year,wins,losses
0,2010,11,5
1,2011,8,8
2,2012,10,6
3,2011,15,1
4,2012,11,5
5,2010,6,10
6,2011,10,6
7,2012,4,12


And rows,in many possible ways.

In [382]:
football.iloc[[0]]

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5


In [383]:
football.loc[[0]]

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5


In [384]:
football[3:5]

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5


In [385]:
football[football.wins > 10]

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
3,2011,Packers,15,1
4,2012,Packers,11,5


In [386]:
football[(football.wins > 10) & (football.team == "Packers")]

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5


### Analysing `pandas` DataFrames

In [308]:
football.dtypes

year       int64
team      object
wins       int64
losses     int64
dtype: object

In [309]:
football.describe()

Unnamed: 0,year,wins,losses
count,8.0,8.0,8.0
mean,2011.125,9.375,6.625
std,0.834523,3.377975,3.377975
min,2010.0,4.0,1.0
25%,2010.75,7.5,5.0
50%,2011.0,10.0,6.0
75%,2012.0,11.0,8.5
max,2012.0,15.0,12.0


In [314]:
football.head()

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5


In [315]:
football.tail()

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12
