# Tutorial Series — Data Analysis with Python - Full Course for Beginners (Numpy, Pandas, Matplotlib, Seaborn)
Source: https://www.youtube.com/watch?v=r-uOLxNrNk8

## About Pandas (practical starts at 1:57:14)
- Pandas is one of the most essential part of data analysis because it is used to format/group/sort data you extracted from numerous sources. It is also used to generate reports and for machine learning etc.
- There are 2 core concepts about Pandas that you should know.
- That is `Series` and `Dataframe`


In [1]:
import pandas as pd
import numpy as np # we will use numpy for some of our pandas examples

## Panda Series

### Series vs List

**Defining A Normal List**

In [2]:
my_list = [1,2,3,4,5,6,7]
my_list

[1, 2, 3, 4, 5, 6, 7]

**Defining A Panda Series**

In [3]:
my_series = pd.Series([1, 2, 3, 4, 5, 6, 7])
my_series

0    1
1    2
2    3
3    4
4    5
5    6
6    7
dtype: int64

**Similarity 1.** Both Panda Series and Python List has indexes. 
- BUT, for Panda Series, you can change the index value like a dictionary — see **Difference 3**

In [4]:
my_list[0]

1

In [5]:
my_list[0] = 100
my_list

[100, 2, 3, 4, 5, 6, 7]

In [6]:
my_series[0]

1

In [7]:
my_series[0] = 100
my_series

0    100
1      2
2      3
3      4
4      5
5      6
6      7
dtype: int64

**Difference 1.** Panda Series has datatype because they are actually backed by numpy arrays.

In [5]:
# Series.dtype — return the dtype object of the underlying data.
my_series.dtype

dtype('int64')

In [9]:
my_numpy = np.array([1,2,3,4,5,6,7])
my_numpy

array([1, 2, 3, 4, 5, 6, 7])

In [10]:
my_series.values

array([100,   2,   3,   4,   5,   6,   7], dtype=int64)

In [11]:
type(my_series.values)

numpy.ndarray

**Difference 2.** Panda Series has a name
- The name will make much more sense when you see it being used in a pandas dataframe

In [12]:
my_series.name = "The 7 numbers"
my_series

0    100
1      2
2      3
3      4
4      5
5      6
6      7
Name: The 7 numbers, dtype: int64

**Difference 3.** You can change the index of a Panda Series
- Now we can access each element by name instead of index
- You may also realize this now look more like a ordered dictionary than a list.
- After you change the index to i.e. strings, you can still access them in order, i.e. using my_series.iloc[0] etc. (see below)

Update: 
- Using `my_series.index` feels like this is the wrong way to do indexing (the order is fucked for some reasons). 
- For example, `Num of Apple` won't be the index of the first value here (for some reasons)
- The solution is to use `pandas.Series.set_axis` (see below)

In [6]:
my_series.index = {
    "Num of Apple",
    "Num of Banana",
    "Num of Candy",
    "Num of Donut",
    "Num of Eclair",
    "Num of Frosty",
    "Num of Gracker",
}
my_series

Num of Donut      1
Num of Eclair     2
Num of Gracker    3
Num of Frosty     4
Num of Candy      5
Num of Apple      6
Num of Banana     7
dtype: int64

I found a stackoverflow that suggests people to use `pandas.Series.set_axis` so that may be more ideal
See: https://pandas.pydata.org/docs/reference/api/pandas.Series.set_axis.html

In [14]:
my_series = my_series.set_axis(['Num of Apple', 'Num of Banana', 'Num of Candy', 'Num of Donut', 'Num of Eclair', 'Num of Frosty', 'Num of Gracker'], axis=0)

my_series

Num of Apple      100
Num of Banana       2
Num of Candy        3
Num of Donut        4
Num of Eclair       5
Num of Frosty       6
Num of Gracker      7
Name: The 7 numbers, dtype: int64

**Accessing A Value by Index**

In [15]:
my_series["Num of Candy"]

3

**Accessing A Value by Numerical Index (even after you changed the index)**

In [7]:
# iloc — integer location — i.e. locate by sequential position
my_series.iloc[2] # get the 3rd element, i.e. Candy in this case, which has the value 3

3

In [8]:
# iloc — integer location — i.e. locate by sequential position
my_series.iloc[-1] # get the last element

7

**Difference 4.** Panda Series supports multi indicies (just like numpy), unlike list
- In other words, I can extract specific parts of a series into another series.

In [18]:
donut_and_banana = my_series[["Num of Donut", "Num of Banana"]]
donut_and_banana

Num of Donut     4
Num of Banana    2
Name: The 7 numbers, dtype: int64

**—— It returns a series datatype**

In [19]:
type(donut_and_banana)

pandas.core.series.Series

**Difference 5.** Panda Series slicing includes the upper limit

In [20]:
my_list[0:2] # get element 0 and 1 (upper limit which is 2 is not included)

[100, 2]

**—— But it doesn't really apply to normal numerical index (lol), only if you use your own index**

In [21]:
my_series[0:2]

Num of Apple     100
Num of Banana      2
Name: The 7 numbers, dtype: int64

In [22]:
my_series["Num of Donut": "Num of Frosty"] # get element 0, 1, 2 (upper limit is included)

Num of Donut     4
Num of Eclair    5
Num of Frosty    6
Name: The 7 numbers, dtype: int64

**Defining A Panda Series (with many variables)**

In [23]:
my_series_2 = pd.Series({
    "Num of Apple": 2,
    "Num of Banana": 4,
    "Num of Candy": 6,
    "Num of Donut": 8,
    "Num of Eclair": 10,
    "Num of Frosty": 12,
    "Num of Gracker": 14
}, name = "My 7 Numbers")

my_series_2

Num of Apple       2
Num of Banana      4
Num of Candy       6
Num of Donut       8
Num of Eclair     10
Num of Frosty     12
Num of Gracker    14
Name: My 7 Numbers, dtype: int64

In [24]:
my_series_3 = pd.Series(
    [2, 4, 6, 8, 10, 12, 14],
    index=["Apple", "Banana", "Candy", "Donut", "Eclair", "Frosty", "Gracker"],
    name = "The 7 Numbers"
)

my_series_3

Apple       2
Banana      4
Candy       6
Donut       8
Eclair     10
Frosty     12
Gracker    14
Name: The 7 Numbers, dtype: int64

### A Summary of Pandas Series

it's a ordered sequence of elements, backed by a NumPy array (and thus it is very fast and efficient) 
It also has an index that can take any labels we pass, and thus it is very good for indexing too.

P.S. I stopped tutorial for Pandas Series at 2:09:10 because the rest not v useful atm

## Started Pandas Dataframe from 2:15:00

![image.png](attachment:32b5e0e9-986c-465e-a1da-4ee1072b02c0.png)

Creating `DataFrame`s manually can be tedious. 99% of the time you'll be pulling the data from a Database, a csv file or the web. But still, you can create a DataFrame by specifying the columns and values:


In [25]:
df = pd.DataFrame({
    'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
    'GDP': [
        1785387,
        2833687,
        3874437,
        2167744,
        4602367,
        2950039,
        17348075
    ],
    'Surface Area': [
        9984670,
        640679,
        357114,
        301336,
        377930,
        242495,
        9525067
    ],
    'HDI': [
        0.913,
        0.888,
        0.916,
        0.873,
        0.891,
        0.907,
        0.915
    ],
    'Continent': [
        'America',
        'Europe',
        'Europe',
        'Europe',
        'Asia',
        'Europe',
        'America'
    ]
}, columns=['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'])
# (The columns attribute is optional. I'm using it to keep the same order as in the picture above)

**A dataframe is very similar to an excel table**

- **Question:** How does dataframe relate to series?
- **Answer:** Each column of dataframe is a series. This means that there are a total of FIVE Pandas series below

In [26]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
0,35.467,1785387,9984670,0.913,America
1,63.951,2833687,640679,0.888,Europe
2,80.94,3874437,357114,0.916,Europe
3,60.665,2167744,301336,0.873,Europe
4,127.061,4602367,377930,0.891,Asia
5,64.511,2950039,242495,0.907,Europe
6,318.523,17348075,9525067,0.915,America


**Assigning index to dataframe (just like how you did it with series)**

In [27]:
df.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


**Accessing Pandas Dataframe columns**

In [28]:
df.columns

Index(['Population', 'GDP', 'Surface Area', 'HDI', 'Continent'], dtype='object')

**Accessing Pandas Dataframe indexes**

In [29]:
df.index

Index(['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
      dtype='object')

### Useful attributes of Pandas Dataframe

In [30]:
# Print a concise summary of a DataFrame.
# This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, Canada to United States
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Population    7 non-null      float64
 1   GDP           7 non-null      int64  
 2   Surface Area  7 non-null      int64  
 3   HDI           7 non-null      float64
 4   Continent     7 non-null      object 
dtypes: float64(2), int64(2), object(1)
memory usage: 336.0+ bytes


In [31]:
# Return an int representing the number of elements in this object.
# Return the number of rows if Series. Otherwise return the number of rows times number of columns if DataFrame.
df.size

35

In [32]:
# Return a tuple representing the dimensionality of the DataFrame.
df.shape # i.e. 7 rows x 5 columns

(7, 5)

In [33]:
# Generate descriptive statistics.
# Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
df.describe()

Unnamed: 0,Population,GDP,Surface Area,HDI
count,7.0,7.0,7.0,7.0
mean,107.302571,5080248.0,3061327.0,0.900429
std,97.24997,5494020.0,4576187.0,0.016592
min,35.467,1785387.0,242495.0,0.873
25%,62.308,2500716.0,329225.0,0.8895
50%,64.511,2950039.0,377930.0,0.907
75%,104.0005,4238402.0,5082873.0,0.914
max,318.523,17348080.0,9984670.0,0.916


In [34]:
# Return the dtypes in the DataFrame.
# This returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns. Columns with mixed types are stored with the object dtype.
df.dtypes

Population      float64
GDP               int64
Surface Area      int64
HDI             float64
Continent        object
dtype: object

In [35]:
# see how many of your value is of what datatype
df.dtypes.value_counts()

float64    2
int64      2
object     1
dtype: int64

### Accessing Values of Datafarme

In [36]:
df

Unnamed: 0,Population,GDP,Surface Area,HDI,Continent
Canada,35.467,1785387,9984670,0.913,America
France,63.951,2833687,640679,0.888,Europe
Germany,80.94,3874437,357114,0.916,Europe
Italy,60.665,2167744,301336,0.873,Europe
Japan,127.061,4602367,377930,0.891,Asia
United Kingdom,64.511,2950039,242495,0.907,Europe
United States,318.523,17348075,9525067,0.915,America


**Use the loc method to get <u>a single row</u> by index name**

In [40]:
# get row data for canada
df.loc['Canada'] # return type is a Pandas series

Population       35.467
GDP             1785387
Surface Area    9984670
HDI               0.913
Continent       America
Name: Canada, dtype: object

**Use the iloc method to get <u>a single row</u> by sequential position**

In [42]:
# get row data for germany (which is at position 2)
df.iloc[2] # return type is a Pandas series

Population        80.94
GDP             3874437
Surface Area     357114
HDI               0.916
Continent        Europe
Name: Germany, dtype: object

**Alternatively, you can get <u>a single column</u> by doing `df['Column Name']`**

In [46]:
df['Population'] # return type is a Pandas series

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: Population, dtype: float64

## — End of Pandas Tutorial (didn't finish all the pandas practical because I don't need to, i ended around 2:23:33) —