## Topics

**Series**

**DataFrame**

**Index Objects, Reindexing**

Data aggregation - group by specific row values

**Dropping Entries from an Axis**

**Indexing, Selection and Filtering**

**loc and iloc**

**Descriptive Statistics**


## Series

https://www.geeksforgeeks.org/python-pandas-series/

- Creating a series
- Accessing an element in a series
- Binary Operations (add, sum, mul, etc.)


### Creating a Series

A series can be created from a list (array) and a dictionary (hash table)

Creating an empty Series

In [None]:
import pandas as pd
pd.Series()

  


Series([], dtype: float64)

Creating a series based on a list. By default, the first index is 0 and the last index is len(array)-1.

In [None]:
import numpy as np
array = np.array(['a','b','c','d'])
pd.Series(array)

0    a
1    b
2    c
3    d
dtype: object

If we want to assign a different index instead of the default ones, we can specify the index as well.

In [None]:
array = np.array(['a','b','c','d'])
pd.Series(array,index=[1,2,3,4])

1    a
2    b
3    c
4    d
dtype: object

Creating a Series from a dictionary.

In [None]:
# aDict stores how many fruits we have.
aDict = {'Apple':3, 'Banana':5, 'Cherry': 2, 'Peach': 10}
pd.Series(aDict)

Apple      3
Banana     5
Cherry     2
Peach     10
dtype: int64

### Sorting a Series

If we want to sort the series based on the value, we can use a built-in function sort_values. By default, it will sort the elements in an ascending order.

In [None]:
s = pd.Series(aDict)
s.sort_values()

Cherry     2
Apple      3
Banana     5
Peach     10
dtype: int64

Let's sort the series in a descending order.

In [None]:
s.sort_values(ascending=False)

Peach     10
Banana     5
Apple      3
Cherry     2
dtype: int64

### Accessing an element in a Series

We can access elements in a Series in the following ways:
- by index number
- by index

In [None]:
array = np.array(['a','b','c','d','e','f','g','h'])
s = pd.Series(array)

We can access the first element by index number 0.

In [None]:
s[0]

'a'

In [None]:
s[5]

'f'

We can access the first 3 elements with index operation [:3]

In [None]:
s[:3]

0    a
1    b
2    c
dtype: object

We can retrieve a single element using index label.

In [None]:
aDict = {'Apple':3, 'Banana':5, 'Cherry': 2, 'Peach': 10}
s = pd.Series(aDict)
s['Apple']

3

In [None]:
s['Peach']

10

In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
s['c']

3

### Binary Operations of Series
- Add
- Sub
- Mul

Let's learn how we can add two series using binary operations.

In [None]:
# creating a first series
s1 = pd.Series([1,5,6,2])
 
# creating a second series
s2 = pd.Series([4,1,3,5])

# answer = s1 + s2
answer = s1.add(s2)
print(answer)

0    5
1    6
2    9
3    7
dtype: int64


We can also subtract a Series from another Series using sub() function.

In [None]:
# answer = s1 - s2
answer = s1.sub(s2)
print(answer)

0   -3
1    4
2    3
3   -3
dtype: int64


In [None]:
answer = s1.mul(s2)
print(answer)

0     4
1     5
2    18
3    10
dtype: int64


In a simpler approach, we can use mathematical notation and it will result in the same output.

In [None]:
s1 + s2

0    5
1    6
2    9
3    7
dtype: int64

In [None]:
s1 - s2

0   -3
1    4
2    3
3   -3
dtype: int64

In [None]:
s1 * s2

0     4
1     5
2    18
3    10
dtype: int64

### Index Object

### Reindexing

### Dataframe

Creating a new column based on existing columns.

In [None]:
data = {'First Name': ['Jennifer', 'Chris', 'John', 'Annie', 'Chloe'], 
        'Last Name': ['Brown', 'Smith', 'Williams', 'Wong', 'Anderson'], 
        'Age': [16, 32, 21, 35, 27],
        'Hourly Wage': [8,14,60,44,80],
        'Hours per week': [20,28,40,40,32]
        } 

df = pd.DataFrame(data)

Creating new columns based on existing columns

In [None]:
df['Full Name'] = df['First Name'] + ' ' + df['Last Name']
df['Weekly Salary'] = df['Hourly Wage'] * df['Hours per week']

print(df)

  First Name Last Name  Age  ...  Hours per week       Full Name Weekly Salary
0   Jennifer     Brown   16  ...              20  Jennifer Brown           160
1      Chris     Smith   32  ...              28     Chris Smith           392
2       John  Williams   21  ...              40   John Williams          2400
3      Annie      Wong   35  ...              40      Annie Wong          1760
4      Chloe  Anderson   27  ...              32  Chloe Anderson          2560

[5 rows x 7 columns]


Creating a new column based on condition

In [None]:
df['High Income'] = [True if x > 2000 else False for x in df['Weekly Salary']]
df

Unnamed: 0,First Name,Last Name,Age,Hourly Wage,Hours per week,Full Name,Weekly Salary,High Income
0,Jennifer,Brown,16,8,20,Jennifer Brown,160,False
1,Chris,Smith,32,14,28,Chris Smith,392,False
2,John,Williams,21,60,40,John Williams,2400,True
3,Annie,Wong,35,44,40,Annie Wong,1760,False
4,Chloe,Anderson,27,80,32,Chloe Anderson,2560,True


Selecting columns that meet certain criteria

In [None]:
df[df['High Income'] == True]

Unnamed: 0,First Name,Last Name,Age,Hourly Wage,Hours per week,Full Name,Weekly Salary,High Income
0,Jennifer,Brown,16,8,20,Jennifer Brown,160,False
1,Chris,Smith,32,14,28,Chris Smith,392,False
3,Annie,Wong,35,44,40,Annie Wong,1760,False


Dropping columns that do not meet certain criteria

In [None]:
df.drop(df[df['High Income'] == True].index)

Unnamed: 0,First Name,Last Name,Age,Hourly Wage,Hours per week,Full Name,Weekly Salary,High Income
0,Jennifer,Brown,16,8,20,Jennifer Brown,160,False
1,Chris,Smith,32,14,28,Chris Smith,392,False
3,Annie,Wong,35,44,40,Annie Wong,1760,False


Updating the dataframe once we drop the rows

In [None]:
df.drop(df[df['High Income'] == True].index, inplace=True)
df

Unnamed: 0,First Name,Last Name,Age,Hourly Wage,Hours per week,Full Name,Weekly Salary,High Income
0,Jennifer,Brown,16,8,20,Jennifer Brown,160,False
1,Chris,Smith,32,14,28,Chris Smith,392,False
3,Annie,Wong,35,44,40,Annie Wong,1760,False


## Tutorials using "" dataset

https://www.kaggle.com/divyansh22/flight-delay-prediction



https://www.kaggle.com/abecklas/fifa-world-cup

https://www.kaggle.com/martj42/international-football-results-from-1872-to-2017

https://www.kaggle.com/timoboz/superbowl-history-1967-2020



https://www.kaggle.com/unsdsn/world-happiness


Flight delay data 
- which airlines experience most delays? (# delays by airline)
- what day of a week experience most delays? (# delays by the day of a week)
- what origin experience most delays? (# delays by origin and destination)
- what percentage of flights managed to arrive within 15 mins even though the flight departed past 15 mins
- lattitude, longitude

Cancellations data
- 

FIFA world cup game
- Compare the score of each game between home team vs. away team
- Compare the perfomance of each team when they are home team vs. away team
- get the performance stat by each country
- pick five countries and see their performance over time (ex: England performed better in the past, etc.)

I can find another relavant dataset and join two datasets. (ex: flight delay data - join the dataset with an airline name)