# Python Libraries for Data Science - `pandas`

**Purpose:** The purpose of this workbook is to help you get comfortable with the topics outlined below.

**Prereqs**
* Python Fundamentals Workbook or a good grasp of basic Python
* Numpy Workbook or a good grasp of creating and manipulating numpy arrays
    
**Recomended Usage**
* Run each of the cells (Shift+Enter) and edit them as necessary to solidify your understanding
* Do any of the exercises that are relevant to helping you understand the material

**Topics Covered**
* Pandas - vocab and data structures, creation and manipulation of data

# Workbook Setup

In [1]:
# Reload all modules before executing a new line
%load_ext autoreload
%autoreload 2

# Abide by PEP8 code style
%load_ext pycodestyle_magic
%pycodestyle_on

In [2]:
import pandas as pd

import numpy as np  # just to show pandas compatability with np
import seaborn as sns  # just for getting some sample datasets

# [`pandas`](https://pandas.pydata.org/pandas-docs/stable/)

`pandas` is a library that comes with many easy-to-use data structures and data analysis tools.

[Pandas Cheatsheet (pdf)](https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf)

[Pandas Docs](https://pandas.pydata.org/pandas-docs/stable/)

## Pandas Vocabulary

Like most libraries, Pandas has some custom vocabulary that we need to understand in order to use it properly. 

* **Rows** in Pandas may be referred to as a **row** or an **index**

Many functions allow `axis` as an argument with the default usually being `axis=0`. This tells Pandas whether you want the function to be performed on the rows or columns. This is the same as how numpy uses axis if you are familiar.

* **Axis 0** will act on all the ROWS in each COLUMN (axis=0 is the same a saying axis='index')
* **Axis 1** will act on all the COLUMNS in each ROW (axis=1 is the same as saying axis='columns')

## Create Pandas Data Structures

There are two data structure classes that Pandas supports, Series and DataFrames. Though actually a DataFrame is just a collection of Series.

```python
pd.Series()
pd.DataFrame()
```

### Create a Series

A 1D indexed array that can hold any data type; use when you have 1D data

In [6]:
# Series with integer data
s1 = pd.Series([1, 2, 3, 4])
s1

0    1
1    2
2    3
3    4
dtype: int64

In [7]:
a = np.array(['a', 'b', 'c'])

s2 = pd.Series(a)
s2

0    a
1    b
2    c
dtype: object

We can also customize the indices if we want

In [8]:
# Series with cusom indicies
s4 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s4

a    0
b    1
c    2
d    3
dtype: int64

In [9]:
# Series from a dictionary
dict = {'a': 0,
        'b': [1, 2, 3],
        'c': 2}

s5 = pd.Series(dict)
s5

a            0
b    [1, 2, 3]
c            2
dtype: object

### Create a Dataframe

A 2D labeled data structure with columns of potentially different types; a collection of Series data structures

In [10]:
# DataFrame from Python list
df1 = pd.DataFrame([0, 10, 20, 30, 40])
df1

Unnamed: 0,0
0,0
1,10
2,20
3,30
4,40


In [11]:
# Dataframe with different types and lengths
df2 = pd.DataFrame([3, ['a', 'b', 'c']])
df2

Unnamed: 0,0
0,3
1,"[a, b, c]"


In [12]:
# Explicitly name the columns
data = [['a', 12], ['b', 20], ['c', 40], ['d', 33], ['e', 88]]

df3 = pd.DataFrame(data, columns=['letters', 'numbers'], dtype=float)
df3

Unnamed: 0,letters,numbers
0,a,12.0
1,b,20.0
2,c,40.0
3,d,33.0
4,e,88.0


We can actually see as discussed before each column in the DataFrame is just a Series

In [13]:
series_letters = df3.letters
series_number = df3.numbers

print(type(df3))
print(type(series_letters))
print(type(series_number))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


In [14]:
# Create a df using the dictionary
data = {'Country': ['Haiti', 'Canada', 'England'],
        'Capital': ['Port-au-Prince', 'Montreal', 'London'],
        'Population': [100000, 200000, 300000]}

df4 = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])
df4

Unnamed: 0,Country,Capital,Population
0,Haiti,Port-au-Prince,100000
1,Canada,Montreal,200000
2,England,London,300000


In [15]:
# Define index & column later
df5 = pd.DataFrame([[1, 2, 3],
                    [4, 5, 6],
                    [7, 8, 9]],
                   index=[1, 2, 3],
                   columns=['a', 'b', 'c'])
df5

Unnamed: 0,a,b,c
1,1,2,3
2,4,5,6
3,7,8,9


In [16]:
df6 = pd.DataFrame(np.arange(12).reshape(3, 4),
                   columns=['A', 'B', 'C', 'D'])
df6

Unnamed: 0,A,B,C,D
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


## Inspecting DataFrames and Series

```python
df.head()
df.tail()
df.sample()

df.shape
df.info
df.columns
df.index()
df.count()
```

We are going to take a look at some sample datasets from Seaborn. You don't need to know anything about Seaborn at this point, they just have some useful datasets that help us see real data using pandas data structures.

In [17]:
diamonds_df = sns.load_dataset('diamonds')

In [18]:
# Checkout the beginning of the data
diamonds_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [19]:
# Checkout the end of the data
diamonds_df.tail()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.5
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61
53937,0.7,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74
53939,0.75,Ideal,D,SI2,62.2,55.0,2757,5.83,5.87,3.64


You can run both `head` and `tail` with an integer (ex. `df.head(3)`) telling it how many rows to print out. The default is 5

In [16]:
# Checkout a random sample of 3 rows from the dataframe
diamonds_df.sample(3)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
47991,0.32,Premium,J,IF,61.2,59.0,533,4.41,4.44,2.71
36222,0.36,Ideal,D,VS2,60.0,56.0,933,4.68,4.66,2.8
38332,0.32,Ideal,E,VVS1,61.3,57.0,1020,4.43,4.38,2.7


In [20]:
# Return rows by columns
diamonds_df.shape

(53940, 10)

In [21]:
# Return index range
diamonds_df.index

RangeIndex(start=0, stop=53940, step=1)

In [22]:
# Describe df cols
diamonds_df.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

In [23]:
# Return count of non-NA values
diamonds_df.count()

carat      53940
cut        53940
color      53940
clarity    53940
depth      53940
table      53940
price      53940
x          53940
y          53940
z          53940
dtype: int64

## I/O

```python
pd.read_csv()
df.to_csv()

pd.read_excel()
df.to_excel()
```

### Read and Write to CSV

In [24]:
df1

Unnamed: 0,0
0,0
1,10
2,20
3,30
4,40


In [26]:
df1.to_csv('myDataFrame.csv')

In [27]:
pd.read_csv('myDataFrame.csv', header=None, nrows=5)

Unnamed: 0,0,1
0,,0
1,0.0,0
2,1.0,10
3,2.0,20
4,3.0,30


## Selection and Assignment

```python
s[1]
df[4:]

df.iloc[]
df.loc[]

df.iat[]
df.at[]
```

We can index DataFrames and Series using bracket notation just like Python lists.

In [29]:
# Select one element from series
s2[1]

'b'

In [30]:
s2[1] = 'v'
s2

0    a
1    v
2    c
dtype: object

We can also select subsets and subsets based on a condition very much like we do in numpy.

In [31]:
# Select from row 2 to end
df1[2:]

Unnamed: 0,0
2,20
3,30
4,40


In [32]:
s1

0    1
1    2
2    3
3    4
dtype: int64

In [34]:
# Select where value is not >3
s1[~(s1 > 3)]

0    1
1    2
2    3
dtype: int64

In [35]:
# Select where values <2 or >3
s1[(s1 < 2) | (s1 > 3)]

0    1
3    4
dtype: int64

In [36]:
# Select where df 'numbers' col are > 50
df3[df3['numbers'] > 50]

Unnamed: 0,letters,numbers
4,e,88.0


In [37]:
# Select as above but assign new val
df3[df3['numbers'] > 50] = ['s', 100]
df3

Unnamed: 0,letters,numbers
0,a,12.0
1,b,20.0
2,c,40.0
3,d,33.0
4,s,100.0


In [38]:
df3

Unnamed: 0,letters,numbers
0,a,12.0
1,b,20.0
2,c,40.0
3,d,33.0
4,s,100.0


Unlike numpy however, for pandas data structures we use lookups via index or label (iloc or loc).

In [39]:
# Select via INDEX - row 0, col 1
df3.iloc[[0], [1]]

Unnamed: 0,numbers
0,12.0


In [40]:
# Select via INDEX - row begin to 2, col 1 and assign the vals 0
df3.iloc[:3, 1] = 0
df3

Unnamed: 0,letters,numbers
0,a,0.0
1,b,0.0
2,c,0.0
3,d,33.0
4,s,100.0


In [41]:
# Select via INDEX - row 1, all cols
df3.iloc[[1], :]

Unnamed: 0,letters,numbers
1,b,0.0


In [42]:
# WILL NOT WORK - use loc instead
# df3.iloc[[0], ['numbers']]

This cell above will **NOT WORK** if you run it because 'numbers' is NOT AN INDEX, its a label. To do this we need to use `loc`, not `iloc`

In [43]:
# Select via LABEL - row 3, col 'numbers'
df3.loc[[3], ['numbers']]

Unnamed: 0,numbers
3,33.0


In [45]:
# Select row indices 1-3, col 'letters'
df3.loc[df3.index[1:3], 'letters'] = 'a'
df3

Unnamed: 0,letters,numbers
0,a,0.0
1,a,0.0
2,a,0.0
3,d,33.0
4,s,100.0


In [46]:
df3.loc[1:3, ['letters', 'numbers']]

Unnamed: 0,letters,numbers
1,a,0.0
2,a,0.0
3,d,33.0


We can also use the faster `iat` and `at` instead of the very similar `iloc` and `loc`. The downside to using `at` and `iat` however is that you can't use arrays as indices (you must use scalars) like you can using `loc` and `iloc`.

In [47]:
df3.iat[0, 1]

0.0

In [48]:
# WILL NOT WORK - can't use arrays as indices
# df3.iat[0, :]

In [49]:
df3.at[3, 'letters']

'd'

In [50]:
# If iat only takes indices, why does this work?
df3.iat[0, df3.columns.get_loc('letters')]

'a'

## Handling Missing Data (replacing, dropping) and Duplicates

```python
df.drop()
df.drop_duplicates()

df.replace()
df.fillna()
```

As data scientists, we spend a lot of time cleaning data....therefore, pandas has some really convenient built-in methods to help us with this.

In [51]:
s2

0    a
1    v
2    c
dtype: object

In [52]:
# Drop 1 on axis 0 (rows)
s = s2.drop(1, axis=0)
s

0    a
2    c
dtype: object

In [53]:
df6

Unnamed: 0,A,B,C,D
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [54]:
# Drop 1 along axis 0 (rows)
df6.drop(1, axis=0)

Unnamed: 0,A,B,C,D
0,0,1,2,3
2,8,9,10,11


In [55]:
# Drop B and D along axis 1 (columns)
df6.drop(['B', 'D'], axis=1)

Unnamed: 0,A,C
0,0,2
1,4,6
2,8,10


We can see df3 has some duplicate rows. We can get rid of them

In [56]:
df3

Unnamed: 0,letters,numbers
0,a,0.0
1,a,0.0
2,a,0.0
3,d,33.0
4,s,100.0


In [57]:
df3.drop_duplicates()

Unnamed: 0,letters,numbers
0,a,0.0
3,d,33.0
4,s,100.0


We can also replace and fill in Na values with 0 or some other value.

In [58]:
df3

Unnamed: 0,letters,numbers
0,a,0.0
1,a,0.0
2,a,0.0
3,d,33.0
4,s,100.0


In [59]:
df3.replace('a', 'd')

Unnamed: 0,letters,numbers
0,d,0.0
1,d,0.0
2,d,0.0
3,d,33.0
4,s,100.0


In [61]:
s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
s3

a    7
c   -2
d    3
dtype: int64

In [62]:
# Create some data with NaN vals
s = s1 + s3
s

0   NaN
1   NaN
2   NaN
3   NaN
a   NaN
c   NaN
d   NaN
dtype: float64

In [63]:
# Fill all np.na values with 0
s.fillna(0)

0    0.0
1    0.0
2    0.0
3    0.0
a    0.0
c    0.0
d    0.0
dtype: float64

## Applying Functions

```python
df.apply(my_funct)
df.applymap(my_funct)
```

In [64]:
df6

Unnamed: 0,A,B,C,D
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


We can apply a function element-wise to a DataFrame

In [65]:
# Apply function element-wise
df6.applymap(lambda x: x*2)

Unnamed: 0,A,B,C,D
0,0,2,4,6
1,8,10,12,14
2,16,18,20,22


We can also apply a function to a specific subset of data in a DataFrame.

In [66]:
df6

Unnamed: 0,A,B,C,D
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [67]:
# Apply function to each row (axis=0)
df = df6.apply(np.sum, axis=0)
df

A    12
B    15
C    18
D    21
dtype: int64

In [68]:
# Apply function to each col (axis=1)
df = df6.apply(np.sum, axis=1)
df

0     6
1    22
2    38
dtype: int64

## Sort and Rank

```python
df.sort_index()
df.sort_values()
df.rank()
```

In [69]:
df4

Unnamed: 0,Country,Capital,Population
0,Haiti,Port-au-Prince,100000
1,Canada,Montreal,200000
2,England,London,300000


In [70]:
# Sort by LABELS (along an axis).
df4.sort_index(axis=1)

Unnamed: 0,Capital,Country,Population
0,Port-au-Prince,Haiti,100000
1,Montreal,Canada,200000
2,London,England,300000


In [71]:
# Sort by VALUES (along an axis)
df4.sort_values(by='Country')

Unnamed: 0,Country,Capital,Population
1,Canada,Montreal,200000
2,England,London,300000
0,Haiti,Port-au-Prince,100000


In [72]:
df3

Unnamed: 0,letters,numbers
0,a,0.0
1,a,0.0
2,a,0.0
3,d,33.0
4,s,100.0


In [73]:
# Compute numerical data ranks (1 through n) along axis
df3.rank(axis=0, method='max')

Unnamed: 0,letters,numbers
0,3.0,3.0
1,3.0,3.0
2,3.0,3.0
3,4.0,4.0
4,5.0,5.0


## Aggregate Functions

```python
df.describe() # show all summary statistics

df.sum()
df.cumsum()

df.min()
df.max()

df.idxmin()
df.idxmax()

df.mean()
df.median()
```

Describe is one of my favorite methods because it gives you a lot of summary stats side by side without you having to do much.

In [74]:
# Show summary stats for the diamonds df dataset
diamonds_df.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


We can also do each of these individually or using specific criteria. Lets look at a smaller df that we can computer in our head so we can see exactly what the functions are doing.

In [75]:
df6

Unnamed: 0,A,B,C,D
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [76]:
# Sum along B
df6['B'].sum()

15

In [77]:
df6.cumsum(axis='index')

Unnamed: 0,A,B,C,D
0,0,1,2,3
1,4,6,8,10
2,12,15,18,21


In [78]:
df6.cumsum(axis='columns')

Unnamed: 0,A,B,C,D
0,0,1,3,6
1,4,9,15,22
2,8,17,27,38


In [79]:
# Min along axis=0 (default axis is always 0 or 'index')
df6.min()

A    0
B    1
C    2
D    3
dtype: int64

In [80]:
df6.max(axis=1)

0     3
1     7
2    11
dtype: int64

In [81]:
# Min id val
df6.idxmin()

A    0
B    0
C    0
D    0
dtype: int64

In [82]:
# Max id val
df6.idxmax()

A    2
B    2
C    2
D    2
dtype: int64

In [83]:
# Mean along axis
df6.mean()

A    4.0
B    5.0
C    6.0
D    7.0
dtype: float64

In [84]:
# Median along axis
df6.median()

A    4.0
B    5.0
C    6.0
D    7.0
dtype: float64

## Arithmetic Operations with Fill Methods

```python
s.add()
s.sub()
s.div()
s.mul()
```

In [85]:
s1 = pd.Series([7, -2, 3], index=['a', 'b', 'c'])
s1

a    7
b   -2
c    3
dtype: int64

In [86]:
s3 = pd.Series([7, -2, 3], index=['a', 'c', 'd'])
s3

a    7
c   -2
d    3
dtype: int64

We can add these datasets

In [87]:
s1 + s3

a    14.0
b     NaN
c     1.0
d     NaN
dtype: float64

You can see it will add where the indices line up. We can also add with a fill value so we don't get NaNs where indices don't match.

In [88]:
s1.add(s3, fill_value=0)

a    14.0
b    -2.0
c     1.0
d     3.0
dtype: float64

We can do the same with subtraction, division, etc.

In [89]:
s1.sub(s3, fill_value=2)

a    0.0
b   -4.0
c    5.0
d   -1.0
dtype: float64

In [90]:
s1.div(s3, fill_value=4)

a    1.000000
b   -0.500000
c   -1.500000
d    1.333333
dtype: float64

In [91]:
s1.mul(s3, fill_value=3)

a    49.0
b    -6.0
c    -6.0
d     9.0
dtype: float64

# Exercises

We all know we don't really learn anything until we have to struggle through doing it :D 

Roll up your sleeves and dive in.

## 1. Create a DataFrame that looks like this

```python
    Animal  Number_legs
0      cat          4.0
1  penguin          2.0
2      dog          4.0
3   spider          8.0
4    snake          NaN
```

In [93]:
# TRY IT HERE

## 2. Create a dataframe by first creating a series for each column

```python
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
```

In [107]:
# TRY IT HERE

## 3. Inspect the following dataset

* Look at 10 random sample rows
* Look at the beginning/end of the dataset
* Calculate the std, mean, etc.

In [3]:
titanic_df = sns.load_dataset('titanic')

In [120]:
# TRY IT HERE

## 4. Drop NA values in the following dataset (age column)

In [6]:
titanic_df.sample(20)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
44,1,3,female,19.0,0,0,7.8792,Q,Third,woman,False,,Queenstown,yes,True
855,1,3,female,18.0,0,1,9.35,S,Third,woman,False,,Southampton,yes,False
817,0,2,male,31.0,1,1,37.0042,C,Second,man,True,,Cherbourg,no,False
818,0,3,male,43.0,0,0,6.45,S,Third,man,True,,Southampton,no,True
502,0,3,female,,0,0,7.6292,Q,Third,woman,False,,Queenstown,no,True
62,0,1,male,45.0,1,0,83.475,S,First,man,True,C,Southampton,no,False
355,0,3,male,28.0,0,0,9.5,S,Third,man,True,,Southampton,no,True
755,1,2,male,0.67,1,1,14.5,S,Second,child,False,,Southampton,yes,False
884,0,3,male,25.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
822,0,1,male,38.0,0,0,0.0,S,First,man,True,,Southampton,no,True


In [127]:
# TRY IT HERE

## 5. Create a subset of the data - low class males under 30

In [25]:
titanic_df.sample(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
694,0,1,male,60.0,0,0,26.55,S,First,man,True,,Southampton,no,True
205,0,3,female,2.0,0,1,10.4625,S,Third,child,False,G,Southampton,no,False
529,0,2,male,23.0,2,1,11.5,S,Second,man,True,,Southampton,no,False
111,0,3,female,14.5,1,0,14.4542,C,Third,child,False,,Cherbourg,no,False
141,1,3,female,22.0,0,0,7.75,S,Third,woman,False,,Southampton,yes,True


In [None]:
# TRY IT HERE

## 6. Find the average depth for each color of diamond

In [72]:
diamonds_df = sns.load_dataset('diamonds')
diamonds_df.sample(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
8709,0.39,Ideal,J,VS2,61.8,55.9,586,4.66,4.69,2.89
6472,1.03,Ideal,J,SI2,62.4,56.0,4054,6.44,6.48,4.03
40802,0.51,Ideal,F,SI2,60.5,55.0,1169,5.17,5.21,3.14
29439,0.32,Very Good,F,VVS2,63.8,55.0,701,4.33,4.39,2.78
1618,0.81,Ideal,E,SI2,61.8,56.0,3013,6.0,5.97,3.7


In [None]:
# TRY IT HERE

## 7. Map all of the depth values in the dataset to values between 0 and 1

In [73]:
diamonds_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [None]:
# TRY IT HERE

## 8. Create a new column of the dataframe that is the sum of x, y, and z

In [79]:
diamonds_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,depth2,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,0.621212,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,0.60404,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,0.574747,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,0.630303,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,0.639394,58.0,335,4.34,4.35,2.75


In [None]:
# TRY IT HERE

## 9. Multiply the series

In [127]:
s1 = pd.Series([7, -2, 3])
s2 = pd.Series([1, 10, 100])

In [None]:
# TRY IT HERE

## 10. Change the dataframe indices to letters

In [102]:
df = pd.DataFrame([1, 2, 3, 4, 5])
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [100]:
# TRY IT HERE

# Appendix

## Answers to Exercises

### 1

```python
    Animal  Number_legs
0      cat          4.0
1  penguin          2.0
2      dog          4.0
3   spider          8.0
4    snake          NaN
```

It looks like we will have numerical indices which will be assigned automatically but we will need to label the dataframe columns manually.

In [101]:
d = {'Animal': ['cat', 'penguin', 'dog', 'spider', 'snake'],
     'Number_legs': [4, 2, 4, 8, np.NaN]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,Animal,Number_legs
0,cat,4.0
1,penguin,2.0
2,dog,4.0
3,spider,8.0
4,snake,


Instead of a dictionary, we can also use two python lists and assign the column labels after

In [106]:
data = [['cat', 4],
        ['penguin', 2],
        ['dog', 4],
        ['spider', 8],
        ['snake', np.NaN]]

df = pd.DataFrame(data, columns=['Animal', 'Number_legs'])
df

Unnamed: 0,Animal,Number_legs
0,cat,4.0
1,penguin,2.0
2,dog,4.0
3,spider,8.0
4,snake,


### 2

```python
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
```

In [114]:
s1 = pd.Series([2, 3, 1])
s1

0    2
1    3
2    1
dtype: int64

In [115]:
s2 = pd.Series([1, np.NaN, 0])
s2

0    1.0
1    NaN
2    0.0
dtype: float64

In [119]:
df = pd.concat([s1, s2], axis=1)
df

Unnamed: 0,0,1
0,2,1.0
1,3,
2,1,0.0


### 3

Titanic Dataset
* Look at 10 random sample rows
* Look at the beginning/end of the dataset
* Calculate the std, mean, etc.

In [123]:
titanic_df.sample(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
632,1,1,male,32.0,0,0,30.5,C,First,man,True,B,Cherbourg,yes,True
872,0,1,male,33.0,0,0,5.0,S,First,man,True,B,Southampton,no,True
12,0,3,male,20.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
811,0,3,male,39.0,0,0,24.15,S,Third,man,True,,Southampton,no,True
450,0,2,male,36.0,1,2,27.75,S,Second,man,True,,Southampton,no,False
181,0,2,male,,0,0,15.05,C,Second,man,True,,Cherbourg,no,True
47,1,3,female,,0,0,7.75,Q,Third,woman,False,,Queenstown,yes,True
817,0,2,male,31.0,1,1,37.0042,C,Second,man,True,,Cherbourg,no,False
354,0,3,male,,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
643,1,3,male,,0,0,56.4958,S,Third,man,True,,Southampton,yes,True


In [124]:
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [125]:
titanic_df.tail()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [126]:
titanic_df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


### 4

We can remove the rows that contain na in the age column using pandas `dropna` function.

In [16]:
# Check how many rows are initially there
titanic_df.shape

(891, 15)

In [21]:
titanic_df[~titanic_df['age'].isna()]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
885,0,3,female,39.0,0,5,29.1250,Q,Third,woman,False,,Queenstown,no,False
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


Looks like we have 714 rows that are not NA in the age column. That should be the length of our new df when we are done.

In [22]:
dropped = titanic_df.dropna(subset=['age'])

Then we can check and make sure we got all of them

In [23]:
# Check how many rows in the new df
dropped.shape

(714, 15)

### 5

Use pandas subsetting feature to select lower class males under 30.

In [26]:
titanic_df.sample(5)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
583,0,1,male,36.0,0,0,40.125,C,First,man,True,A,Cherbourg,no,True
733,0,2,male,23.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
535,1,2,female,7.0,0,2,26.25,S,Second,child,False,,Southampton,yes,False
662,0,1,male,47.0,0,0,25.5875,S,First,man,True,E,Southampton,no,True
825,0,3,male,,0,0,6.95,Q,Third,man,True,,Queenstown,no,True


In [36]:
df = titanic_df[(titanic_df['age'] < 30) & (titanic_df['sex'] == 'male')]
df.sample(30)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
86,0,3,male,16.0,1,3,34.375,S,Third,man,True,,Southampton,no,False
131,0,3,male,20.0,0,0,7.05,S,Third,man,True,,Southampton,no,True
372,0,3,male,19.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
321,0,3,male,27.0,0,0,7.8958,S,Third,man,True,,Southampton,no,True
242,0,2,male,29.0,0,0,10.5,S,Second,man,True,,Southampton,no,True
34,0,1,male,28.0,1,0,82.1708,C,First,man,True,,Cherbourg,no,False
683,0,3,male,14.0,5,2,46.9,S,Third,child,False,,Southampton,no,False
693,0,3,male,25.0,0,0,7.225,C,Third,man,True,,Cherbourg,no,True
762,1,3,male,20.0,0,0,7.2292,C,Third,man,True,,Cherbourg,yes,True
336,0,1,male,29.0,1,0,66.6,S,First,man,True,C,Southampton,no,False


### 6

We can use the `groupby` function and the aggregate `mean` function to get the average depth for each color.

We can start by reminding ourselves what the data looks like.

In [39]:
diamonds_df.sample(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
47906,0.53,Ideal,G,VVS2,62.5,54.0,1914,5.21,5.19,3.25
43888,0.52,Ideal,I,VVS1,60.1,56.0,1454,5.24,5.27,3.16
18108,1.02,Ideal,F,VS2,62.0,56.0,7326,6.42,6.45,3.99
23127,1.51,Premium,G,SI1,62.6,58.0,11153,7.27,7.3,4.56
5418,1.05,Premium,H,SI2,63.0,57.0,3822,6.47,6.42,4.06


Then use the groupby function followed by the aggregate mean.

In [42]:
diamonds_df.groupby('color').mean()

Unnamed: 0_level_0,carat,depth,table,price,x,y,z
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
D,0.657795,61.698125,57.40459,3169.954096,5.417051,5.421128,3.342827
E,0.657867,61.66209,57.491201,3076.752475,5.41158,5.419029,3.340689
F,0.736538,61.694582,57.433536,3724.886397,5.614961,5.619456,3.464446
G,0.77119,61.757111,57.288629,3999.135671,5.677543,5.680192,3.505021
H,0.911799,61.83685,57.517811,4486.669196,5.983335,5.984815,3.695965
I,1.026927,61.846385,57.577278,5091.874954,6.222826,6.22273,3.845411
J,1.162137,61.887215,57.812393,5323.81802,6.519338,6.518105,4.033251


### 7

We can use the `apply` function to map values from one value to another.

In [74]:
diamonds_df.sample(10)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
25185,1.51,Ideal,F,VS2,62.2,53.0,13771,7.35,7.31,4.56
38528,0.34,Ideal,E,VS1,61.2,57.0,1033,4.5,4.45,2.74
2287,0.9,Very Good,H,SI2,59.5,63.0,3160,6.26,6.31,3.74
53897,1.02,Good,H,I1,64.3,63.0,2751,6.28,6.23,4.02
22289,1.42,Premium,F,VS1,58.4,59.0,10338,7.36,7.32,4.29
34191,0.33,Ideal,F,VS2,61.6,56.0,854,4.46,4.44,2.74
47976,0.29,Ideal,G,VS1,61.9,55.0,532,4.25,4.28,2.64
41450,0.4,Ideal,G,IF,61.2,56.0,1229,4.76,4.81,2.93
35696,0.3,Premium,E,VS1,61.8,58.0,911,4.33,4.31,2.67
37659,0.42,Premium,F,SI1,62.9,57.0,992,4.77,4.73,2.99


First lets look at the range of values we are working with in the depth column.

In [75]:
diamonds_df.describe()['depth']

count    53940.000000
mean        61.749405
std          1.432621
min         43.000000
25%         61.000000
50%         61.800000
75%         62.500000
max         79.000000
Name: depth, dtype: float64

Then we can use the apply function to map the values to a new range

In [76]:
df = diamonds_df.apply(lambda x: x['depth']/99, axis=1)
df

0        0.621212
1        0.604040
2        0.574747
3        0.630303
4        0.639394
           ...   
53935    0.614141
53936    0.637374
53937    0.634343
53938    0.616162
53939    0.628283
Length: 53940, dtype: float64

We could also add our mapped value as another column in the dataframe if we'd like.

In [77]:
diamonds_df.insert(5, 'depth2', df)
diamonds_df.sample(10)

Unnamed: 0,carat,cut,color,clarity,depth,depth2,table,price,x,y,z
45646,0.25,Ideal,H,VVS1,60.2,0.608081,56.0,525,4.1,4.11,2.47
16695,0.32,Ideal,G,SI2,62.5,0.631313,55.0,421,4.37,4.4,2.74
50975,0.31,Premium,G,VS2,62.4,0.630303,59.0,544,4.34,4.38,2.72
50477,0.7,Very Good,F,SI1,62.3,0.629293,57.0,2267,5.62,5.65,3.51
26676,0.27,Ideal,G,SI1,62.2,0.628283,54.0,426,4.13,4.17,2.58
7074,0.33,Premium,G,VS2,60.8,0.614141,58.0,579,4.45,4.47,2.71
28266,0.28,Very Good,H,SI1,61.5,0.621212,56.0,360,4.21,4.24,2.6
53162,0.72,Ideal,E,SI1,62.0,0.626263,57.0,2626,5.7,5.75,3.55
28946,0.3,Ideal,G,VVS2,60.6,0.612121,57.0,684,4.33,4.35,2.63
19077,1.22,Premium,G,VS1,61.6,0.622222,60.0,7850,6.85,6.88,4.23


### 8

There are many ways to solve this problem but we could very easily add a new column to the dataframe using pandas `assign` function.

In [98]:
diamonds_df.head()

Unnamed: 0,carat,cut,color,clarity,depth,depth2,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,0.621212,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,0.60404,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,0.574747,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,0.630303,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,0.639394,58.0,335,4.34,4.35,2.75


In [99]:
diamonds_df.assign(added=lambda i: i.x + i.y + i.z)

Unnamed: 0,carat,cut,color,clarity,depth,depth2,table,price,x,y,z,added
0,0.23,Ideal,E,SI2,61.5,0.621212,55.0,326,3.95,3.98,2.43,10.36
1,0.21,Premium,E,SI1,59.8,0.604040,61.0,326,3.89,3.84,2.31,10.04
2,0.23,Good,E,VS1,56.9,0.574747,65.0,327,4.05,4.07,2.31,10.43
3,0.29,Premium,I,VS2,62.4,0.630303,58.0,334,4.20,4.23,2.63,11.06
4,0.31,Good,J,SI2,63.3,0.639394,58.0,335,4.34,4.35,2.75,11.44
...,...,...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,0.614141,57.0,2757,5.75,5.76,3.50,15.01
53936,0.72,Good,D,SI1,63.1,0.637374,55.0,2757,5.69,5.75,3.61,15.05
53937,0.70,Very Good,D,SI1,62.8,0.634343,60.0,2757,5.66,5.68,3.56,14.90
53938,0.86,Premium,H,SI2,61.0,0.616162,58.0,2757,6.15,6.12,3.74,16.01


### 9

In [128]:
s1 = pd.Series([7, -2, 3])
s2 = pd.Series([1, 10, 100])

In [129]:
s1.multiply(s2)

0      7
1    -20
2    300
dtype: int64

### 10

In [120]:
df = pd.DataFrame([1, 2, 3, 4, 5])
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


We could do it like this or using the `set_index` function

In [121]:
df.index = ['a', 'b', 'c', 'd', 'e']
df

Unnamed: 0,0
a,1
b,2
c,3
d,4
e,5


## Troubleshooting Tips

Having trouble running the notebook?

If you run into issues running any of the code in this notebook, check your version of Jupyter and IPython as well as any extensions, libraries, etc.

```bash
!jupyter --version

jupyter core     : 4.6.1
jupyter-notebook : 6.0.2
qtconsole        : not installed
ipython          : 7.9.0
ipykernel        : 5.1.3
jupyter client   : 5.3.4
jupyter lab      : 1.2.3
nbconvert        : 5.6.1
ipywidgets       : not installed
nbformat         : 4.4.0
traitlets        : 4.3.3
```

In [6]:
# # Run this cell to check the version of Jupyter you are running
# !jupyter --version

In [2]:
# # Run one of these cells to check what extensions you are using
# !jupyter-labextension list
# !jupyter-nbextension list

In [1]:
# # Check ipython version
# import sys
# print(sys.version)

If you are still having issues, try restarting your kernel and/or reloading the notebook completely.