# Pandas lecture 1

In this class, we will start our deep dive into the `pandas` library. This lecture tries to lay the foundation of how pandas data types function. There are also a lot of details in most of the section. It is important that you understand each of the details, but refer back to them as you need them.

`pandas` is a huge library, and there are multiple ways of doing everything. We are only going to learn a few of those ways.

In [2]:
import pandas as pd

## Introducing pandas data types

### Series

- `Series` is a datatype in `pandas` for storing an array of data. It is similar to `list` in the way that there is an order of elements in the `Series`. 
- In addition, it also has an index, i.e., every element in the `Series` is mapped to a _key_. This is similar to a `dict`, though unlike a dictionary, the `Series` has an order.
- The index to value mapping never changes by itself, only if you manually change the index. Thus, you can use the index to define an order.

#### Creating a series

Here are some ways to create an object of type `Series`.

In [7]:
series_simple = pd.Series([1, 2, 3, 4, 5])
series_simple

0    1
1    2
2    3
3    4
4    5
dtype: int64

- Note that two columns are printed. The left one is the index column, the right column is the data column
- `Series` type also takes note of the data type of the data in the series, as you can see in the last line of the output above. `dtype` can be specificed when creating the `Series`, or you can let `pandas` infer it automatically.

In [9]:
series_custom_index = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
series_custom_index

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [11]:
dict = {'c': 1, 'b': 2, 'a': 3}
series_from_dict = pd.Series(dict)
series_from_dict

c    1
b    2
a    3
dtype: int64

In [14]:
series_scalar = pd.Series(7)
series_scalar

0    7
dtype: int64

In [19]:
series_repeated_element = pd.Series(7, range(10,20))
series_repeated_element

10    7
11    7
12    7
13    7
14    7
15    7
16    7
17    7
18    7
19    7
dtype: int64

In [25]:
series_repeated_element.get(10)

7

#### Accessing elements from Series

There are many ways to access elements from a series. Let's create a simple series to see how it works.

In [33]:
series_simple = pd.Series(range(10, 20))
series_simple

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
dtype: int64

You can use the indexing operator (`[]`) or the `get()` method to access single values, like in a list.

In [34]:
print("indexing operator:", series_simple[0])
print("get() method:", series_simple.get(0))

indexing operator: 10
get() method: 10


The `dict` analogy for indices as keys continues here - you can use the index value as a key to get the corresponding data value.

In [48]:
print(series_from_dict['a'])
print(series_from_dict.get('a'))

3
3


This also creates confusion. A value of 0 can be an index key, or it can be the first element in the Series. If the index is an integer, then both `[]` operator and the `get()` method work on the index value, and don't think of 0 as the first element in the Series. The example below makes it clear.

In [47]:
# doesn't work because indices on this Series start from 10.
# series_repeated_element[0]

# works as expected, if you think of index value as the key
series_repeated_element[10]

7

Slices work like they in list. In case of a slice, there is no confusion with the index values. The slice always refers to the position of the elements in the Series, and not the value of the index.

In [49]:
series_simple[10:15]

Series([], dtype: int64)

There are many custom ways to access values from a Series, which are not present in Python `list` or `dict` types. Here are some examples:

In [39]:
select_values = series_simple[[1,3,6]]
select_values

1    11
3    13
6    16
dtype: int64

In [52]:
filtered_values = series_simple[series_simple > 15]
filtered_values

6    16
7    17
8    18
9    19
dtype: int64

We will spend more time on the expression language for filtering later.

#### Operators on Series objects

You can run operators on the series objects to do manipulations on the data.

In [66]:
scalar_multiply = series_simple * 2
scalar_multiply

0    20
1    22
2    24
3    26
4    28
5    30
6    32
7    34
8    36
9    38
dtype: int64

In [67]:
series2 = pd.Series(range(20, 30))
series2

0    20
1    21
2    22
3    23
4    24
5    25
6    26
7    27
8    28
9    29
dtype: int64

In [80]:
sum_two_series = series_simple + series2
sum_two_series

0    30
1    32
2    34
3    36
4    38
5    40
6    42
7    44
8    46
9    48
dtype: int64

When you sum two series (or use any other operator between two series), the summing happens element by element by matching the index. In the case above, the output for index 0 is the sum of `series_simple[0]` and `series2[0]`.

What happens if both series don't have a value for a certain index?

In [63]:
series2 = pd.Series(range(20,30), index=range(10,20))
series_simple + series2

0    NaN
1    NaN
2    NaN
3    NaN
4    NaN
5    NaN
6    NaN
7    NaN
8    NaN
9    NaN
10   NaN
11   NaN
12   NaN
13   NaN
14   NaN
15   NaN
16   NaN
17   NaN
18   NaN
19   NaN
dtype: float64

Can you explain why we see this output?

**Exercise:** Can you write a program to create 12 `Series` objects, where each of the `Series` objects contains multiples of 1, 2, 3, 4,...,12 respectively?

### DataFrame

- `DataFrame` is the most commonly used datatype in pandas. 
- An object of type `DataFrame` represents a _table_ of data, similar to how we talked about tables while working in SQL. But, it extends the concept of table in useful ways.
- Internally, `DataFrame` can be thought of as a mapping from column names to `Series` objects, all of which use the same index.
- That index is the index for the whole `DataFrame` and is used to identify the rows in the table.

In the last lecture, we created a `DataFrame` object by reading a csv file. Let's do that again and explore the structure of `DataFrame` a bit.

In [41]:
df_top_songs = pd.read_csv('top_100_songs.csv')
df_top_songs.head(5)

Unnamed: 0,year,position,artist,song,score,us,uk,de,fr,ca,au
0,2000,1,Faith Hill,Breathe,24030.051,2,33,-,-,-,23
1,2000,2,Joe Thomas,I Wanna Know,21516.777,4,-,-,61,-,34
2,2000,3,Santana & The Product G,Maria Maria,20941.78,1,6,-,1,-,49
3,2000,4,Vertical Horizon,Everything You Want,20402.965,1,42,-,-,-,24
4,2000,5,Toni Braxton,He Wasn't Man Enough,20068.614,2,5,-,14,-,5


In [42]:
type(df_top_songs['year'])

pandas.core.series.Series

In [43]:
id(df_top_songs.index)

140555678306992

In [44]:
id(df_top_songs['year'].index)

140555678306992

- You can select any column of a `DataFrame` using the `dict` like syntax. 
- As you see, that the column of a `DataFrame` is stored internally as a `Series` object.
- And if we check the memory location of `DataFrame` object's index, and the `Series` object's (that is a column in the `DataFrame`) index, they are identical. Thus, they are all using the same index instance.

#### Creating a DataFrame

Like series, there are many ways to create a `DataFrame`. Since we know how to create a `Series`, let's first learn how to create a `DataFrame` using a number of `Series` objects.

In [85]:
series_dict = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
               'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

df_from_dict_series = pd.DataFrame(series_dict)
df_from_dict_series

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


- To create a `DataFrame` from many `Series` objects, you can create a `dict` where the keys are the names of the columns, and the `Series` objects contain the column data.
- You can simply pass that `dict` object as an argument to the constructor of `DataFrame`, and you will get a `DataFrame` object.
- If your individual series have non-identical indices, `pandas` will take a union of them, and fill empty slots with `NaN`.

In [87]:
list_dict = {'one': [1, 2, 3, 5],
             'two': [1, 2, 3, 4]}

df_from_dict_list = pd.DataFrame(list_dict)
df_from_dict_list

Unnamed: 0,one,two
0,1,1
1,2,2
2,3,3
3,5,4


- You can simply have lists in the source dictionary, instead of `Series` objects.
- In this case, all the lists have to be of the same length, otherwise `pandas` will raise an error.

You can create a custom index too:

In [88]:
df_custom_index = pd.DataFrame(list_dict, index=['A', 'B', 'C', 'D'])
df_custom_index

Unnamed: 0,one,two
A,1,1
B,2,2
C,3,3
D,5,4


We will check out one more way to create a `DataFrame` - using a list of tuples.

In [89]:
tuples = [('Arya', 'Female'), ('Jon Snow', 'Male'), ('Varys', "Male?")]
df_tuples = pd.DataFrame(tuples, columns=['Name', 'Gender'])
df_tuples

Unnamed: 0,Name,Gender
0,Arya,Female
1,Jon Snow,Male
2,Varys,Male?


As you can see, each tuple's data is assumed to be a row of data

We can create a `DataFrame` from an existing dataframe, and select the columns and rows we need.

In [96]:
df_select_index = pd.DataFrame(df_top_songs, index=range(100))
df_select_index.tail(5)

Unnamed: 0,year,position,artist,song,score,us,uk,de,fr,ca,au
95,2000,96,Toby Keith,How Do You Like Me Now?!,6158.155,31,-,-,-,-,-
96,2000,97,Jessica Simpson,I Think I'm In Love With You,6113.016,21,15,-,-,-,10
97,2000,98,The Goo Goo Dolls,Broadway,6033.618,24,-,-,-,-,-
98,2000,99,Britney Spears,From The Bottom Of My Broken Heart,6020.75,14,-,-,-,-,37
99,2000,100,Lee Ann Womack,I Hope You Dance,5976.019,32,-,-,-,-,-


In [97]:
df_select_columns = pd.DataFrame(df_top_songs, columns=['year', 'position', 'artist', 'song'])
df_select_columns.head(5)

Unnamed: 0,year,position,artist,song
0,2000,1,Faith Hill,Breathe
1,2000,2,Joe Thomas,I Wanna Know
2,2000,3,Santana & The Product G,Maria Maria
3,2000,4,Vertical Horizon,Everything You Want
4,2000,5,Toni Braxton,He Wasn't Man Enough


- In above two cases, we create a new `DataFrame` from the top_songs dataframe, once by selecting a few rows, and then by selecting only a few columns. Obviously, we can combine both selections.
- It is useful to understand that `DataFrame` treats the index (representing rows) and the columns in the same way. Both of them are of type `Index` in `pandas`. The following two cells show what I mean.

In [45]:
type(df_top_songs.index)

pandas.core.indexes.range.RangeIndex

In [46]:
type(df_top_songs.columns)

pandas.core.indexes.base.Index

This also means that each row is of type `Series`. So, everything we learned about `Series` applies to both rows and columns.

**Exercise:** Extend the previous exercise (on creating 12 `Series` objects), to create a `DataFrame` that looks like a 12x12 multiplication table.

#### Column operations

In this section, we will learn how to select, add and modify columns in a dataframe.

As you have already seen, using `[]` operator with the column name fetches you the column

In [47]:
df_top_songs['song'].head(5)

0                 Breathe
1            I Wanna Know
2             Maria Maria
3     Everything You Want
4    He Wasn't Man Enough
Name: song, dtype: object

You can add new columns like you add new elements in a dictionary, by assigning a new `Series` object, or any sequence.

In [48]:
df_top_songs['simple_score'] = (df_top_songs['score'] / max(df_top_songs['score'])).round(2)
df_top_songs.head(5)

Unnamed: 0,year,position,artist,song,score,us,uk,de,fr,ca,au,simple_score
0,2000,1,Faith Hill,Breathe,24030.051,2,33,-,-,-,23,0.86
1,2000,2,Joe Thomas,I Wanna Know,21516.777,4,-,-,61,-,34,0.77
2,2000,3,Santana & The Product G,Maria Maria,20941.78,1,6,-,1,-,49,0.75
3,2000,4,Vertical Horizon,Everything You Want,20402.965,1,42,-,-,-,24,0.73
4,2000,5,Toni Braxton,He Wasn't Man Enough,20068.614,2,5,-,14,-,5,0.72


In this case, we used the inbuilt `max()` function to find the maximum, division operator to divide a `Series` with a scalar, and then a method called `round()` in the `Series` type, that rounds the number to two decimal places.

You can delete a column using the `del` operator:

In [49]:
del df_top_songs['score']
df_top_songs.head(5)

Unnamed: 0,year,position,artist,song,us,uk,de,fr,ca,au,simple_score
0,2000,1,Faith Hill,Breathe,2,33,-,-,-,23,0.86
1,2000,2,Joe Thomas,I Wanna Know,4,-,-,61,-,34,0.77
2,2000,3,Santana & The Product G,Maria Maria,1,6,-,1,-,49,0.75
3,2000,4,Vertical Horizon,Everything You Want,1,42,-,-,-,24,0.73
4,2000,5,Toni Braxton,He Wasn't Man Enough,2,5,-,14,-,5,0.72


#### Accessing data in a DataFrame

Here is the summary of how you can access data in a dataframe.

|Operation                      |Syntax                                     |Result   |
|-------------------------------|-------------------------------------------|---------|
|Select column                              |df['col_name']                 |Series   |
|Select multiple columns                    |df[['col_name1, 'col_name2']]  |DataFrame|
|Select row by label                        |df.loc[label]                  |Series   |
|Select row by integer location             |df.iloc[loc]                   |Series   |
|Select column by integer location          |df.iloc[:, 0]                  |Series   |
|Select row, column by integer location     |df.iloc[0, 0]                  |dtype    |
|Select multiple columns by integer location|df.iloc[:, [0, 1]]             |DataFrame|
|Slice rows                                 |df.iloc[5:10]                  |DataFrame|
|Select rows by boolean vector (filtering)  |df[bool_vec]                   |DataFrame|

We have already checked out the first one. Let's checkout the others too.

For these examples, let's reassign the index of the `DataFrame`.

In [79]:
df_top_songs.index = range(10000,10000 + len(df_top_songs))
df_top_songs.head(5)

Unnamed: 0,year,position,artist,song,us,uk,de,fr,ca,au,simple_score
10000,2000,1,Faith Hill,Breathe,2,33,-,-,-,23,0.86
10001,2000,2,Joe Thomas,I Wanna Know,4,-,-,61,-,34,0.77
10002,2000,3,Santana & The Product G,Maria Maria,1,6,-,1,-,49,0.75
10003,2000,4,Vertical Horizon,Everything You Want,1,42,-,-,-,24,0.73
10004,2000,5,Toni Braxton,He Wasn't Man Enough,2,5,-,14,-,5,0.72


In [80]:
# Accessing table entry for column artist and row with index 3
df_top_songs['artist'][10003]

'Vertical Horizon'

In [81]:
# Same as above
df_top_songs.loc[10003]['artist']

'Vertical Horizon'

In [82]:
# Same as above, again
df_top_songs.iloc[3]['artist']

'Vertical Horizon'

Note the difference between `loc()` and `iloc()` in the last two examples.
- `loc()` uses the index label
- `iloc()` uses the position of the row

In [83]:
# Same yet again, as artist Column is at position = 2
df_top_songs.iloc[3, 2]

'Vertical Horizon'

In [84]:
# We get a series representing the first 4 columns of row = 3
df_top_songs.iloc[3, :4]

year                       2000
position                      4
artist         Vertical Horizon
song        Everything You Want
Name: 10003, dtype: object

In [85]:
x = [1,2,3]
x[:]

[1, 2, 3]

In [86]:
# We get all the rows, but only the specified columns
df_top_songs.iloc[:, [0,2,3]].head(10)

Unnamed: 0,year,artist,song
10000,2000,Faith Hill,Breathe
10001,2000,Joe Thomas,I Wanna Know
10002,2000,Santana & The Product G,Maria Maria
10003,2000,Vertical Horizon,Everything You Want
10004,2000,Toni Braxton,He Wasn't Man Enough
10005,2000,Rob Thomas & Santana,Smooth
10006,2000,Aaliyah,Try Again
10007,2000,Matchbox Twenty,Bent
10008,2000,Lonestar,Amazed
10009,2000,Destiny's Child,Say My Name


In [87]:
df_some_rows = df_top_songs.iloc[10:15]
df_some_rows

Unnamed: 0,year,position,artist,song,us,uk,de,fr,ca,au,simple_score
10010,2000,11,Three Doors Down,Kryptonite,3,-,-,-,-,47,0.65
10011,2000,12,Madonna,Music,1,1,-,8,1,1,0.63
10012,2000,13,Destiny's Child,Jumpin' Jumpin',3,5,-,41,-,2,0.63
10013,2000,14,Sisqo,Thong Song,3,3,-,15,-,2,0.62
10014,2000,15,Creed,Higher,7,47,-,-,-,-,0.61


Note that both `df_top_songs[10:15]` and `df_top_songs.loc[10:15]` would produce the same result. When you are using the slicing operator, pandas uses them to specify _position_ of the rows instead of the index labels. This is consistent with how slicing works with `Series`.

But, I prefer to use `iloc()` in this case, as it uses position of the rows to index rows in every case.

In [88]:
df_some_rows_columns = df_top_songs.iloc[10:15, :4]
df_some_rows_columns

Unnamed: 0,year,position,artist,song
10010,2000,11,Three Doors Down,Kryptonite
10011,2000,12,Madonna,Music
10012,2000,13,Destiny's Child,Jumpin' Jumpin'
10013,2000,14,Sisqo,Thong Song
10014,2000,15,Creed,Higher


In [89]:
bool_series = (df_top_songs['simple_score'] > 0.5) & (df_top_songs['position'] < 10)
bool_series.head(10)

10000     True
10001     True
10002     True
10003     True
10004     True
10005     True
10006     True
10007     True
10008     True
10009    False
dtype: bool

In [90]:
df_filtered = df_top_songs[bool_series]
len(df_filtered)

76

- The filter works by giving a `Series` of booleans which is of the same length as the number of rows in the dataframe, in the expression `df[boolean_series]`.
- This will select only those rows for which the corresponding boolean in the `boolean_series` is True.


- The expressions used in filtering are achieved by overriding the operators. 
- You can use the expression of type `Series op Scalar` or `Series op Series`, where op can be `<`, `<=`, `==`, `!=`, `>`, `>=`.
- You can combine multiple such expressions using `&` (replacement for `and`) and `|` (replacement for `or`). Because of operator precedence, you need to use paranthesis on individual expressions when you combine multiple expressions.
- You can use the `~` operator as a replacement for not.
- The result of these expressions is a `Series` of booleans.

## Exercises

1. Create a `DataFrame` that has the columns Name, Age Group, Gender, Relationship to you for some people in your family 
and friend circle.


2. Create a `DataFrame` with 10 columns and 100 rows, where row entry in the dataframe is a random integer between 1 and 1000. 
  - Find the mean value of each column. Check if all the 10 mean values are similar or rather different.
  - Repeat the process a few times and see if your observations change.


The following assignments will use the Google Play Store CSV file ([`googlestoredata.csv`](https://raw.githubusercontent.com/amangup/data-analysis-bootcamp/master/06-Pandas1/googlestoredata.csv))


3. Let's explore a few of the methods in the `Series` type in `pandas`. [This page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) has the list of methods for the `Series` class. For the following exercises, you will have to find the appropriate method and use it.
  - Find all the unique values in the `Content Rating` column.
  - Find the mean value of the `Rating` column.
  - Find all the apps whose rating is equal to the minimum rating in the table.
  - Find the apps which has the most reviews.
  - Find the number of apps with `ART_AND_DESIGN` or `EDUCATION` category.
  - Find the most common category.


4. Find the weighted average of rating for all apps, where the weight in the calculation is the Install Count. Note that this value is higher than the mean value of the `Rating` column - what does this indicate?


5. We are going to find a selection of good apps using our own criteria - apps that have a good rating, and also that it is so good that a lot of its users chose to review it.
  - Define a new column called `review_ratio` which calculates the ratio of number of reviews to number of installs.
  - Filter the dataframe with apps whose rating is > 4.5, and whose `review_ratio` is more than 5%.
  - How many apps are we left with?
  

6. What's the perfect app? Find all apps whose rating is equal to the maximum rating. Among these apps, find the app with the maximum number of reviews.


7. Find all the apps with the most recent `Last Updated` date. The following hints will be useful:
  - The `apply()` method in the `Series` class may be handy. Note that this method takes a function as an argument.
  - The date format to use with the `strptime()` method in the `datetime` module (if using) is `"%B %d, %Y"`.
 