## Series
- A SERIES isnused to model one dimensional data, similar to a LIST in Python.
- Because a series is one dimensional, it has a single axis—the index. Below is a table of counts of 
songs artists composed:
![series_example.PNG](attachment:series_example.PNG)

In [1]:
import pandas as pd
ser = pd.Series([145, 142, 38, 13], name = 'counts')
ser

0    145
1    142
2     38
3     13
Name: counts, dtype: int64

## Loading your first Data Set
- The simplest way to of looking at a data set is to examine and subset specific rows and columns. We
 can see which type of information is stored in each column, and can start looking for patterns by 
    aggregating descriptive statistics.

In [1]:
import pandas as pd

url = r"C:\Users\ajayi\OneDrive\Desktop\Training_Recordings\gapminder.csv"
gap_df = pd.read_csv(url)


## Explore the data
- Check the first few rows of the data
- Check the dimensions of the data
- check for any missing number
- perform some simple statistics on the data

In [32]:
# check the few rows of the data
# each column has to be the same type


In [31]:
# check the dimensions of the data


In [30]:
# check for any missing values


In [29]:
# perform some simple statistics on the data


In [28]:
# check the column names of the data since we have only six features


In [27]:
# check the data types of the features


In [26]:
# get more information about the data using the info method


## Subsetting Columns
- If we want to examine multiple columns, we can specify them by name, positions, or ranges.

## Subsetting Columns by Name
- If we want to examine column from our data, we can access the data using square brackets.

In [10]:
# Just get the country column and save it to its own variable
country_df = gap_df['country']

# show the first 5 rows
country_df.head()

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

In [11]:
# show the last 5 rows of the country dataframe
country_df.tail()


1699    Zimbabwe
1700    Zimbabwe
1701    Zimbabwe
1702    Zimbabwe
1703    Zimbabwe
Name: country, dtype: object

### Subsetting Multiple columns by name
- To specify multiple columns by the column name, we need to pass in a Python list between the square
brackets.

In [12]:
# looking at country, continent, and year
subset = gap_df[['country', 'continent', 'year']]

subset.head()

Unnamed: 0,country,continent,year
0,Afghanistan,Asia,1952
1,Afghanistan,Asia,1957
2,Afghanistan,Asia,1962
3,Afghanistan,Asia,1967
4,Afghanistan,Asia,1972


## Subsetting Rows
- Rows can be subset in multiple ways, by row name or row index:
    - $loc$: subset based on row label.
    - $iloc$: subset based on row index or row number.


In [15]:
# get the first row of the original data
gap_df.loc[0]

# get the 100th row
gap_df.loc[99]

country      Bangladesh
continent          Asia
year               1967
lifeExp          43.453
pop            62821884
gdpPercap       721.186
Name: 99, dtype: object

## Short exercise discussion

In [18]:
# how can we get the last row from the dataframe by subsetting?

number_of_rows = gap_df.shape[0]
last_row_index = number_of_rows - 1

gap_df.loc[last_row_index]

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 1703, dtype: object

In [19]:
# Alternative way
gap_df.tail(1)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


## Subsetting Multiple Rows

In [21]:
# select the first, 100th, and 1000th rows
gap_df.loc[[0, 99, 999]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
99,Bangladesh,Asia,1967,43.453,62821884,721.186086
999,Mongolia,Asia,1967,51.253,1149500,1226.04113


## Subsetting Rows by Row Number: $iloc$
- $iloc$ does the same thing as loc but is used to subset by the row index number. In this example,
$iloc$ and $loc$ will behave exactly the same way since the index labels are the row numbers.
However, keep in mind that the index labels do not necessarily have to be row numbers.

In [22]:
# get the 2nd row
gap_df.iloc[1]

country      Afghanistan
continent           Asia
year                1957
lifeExp           30.332
pop              9240934
gdpPercap        820.853
Name: 1, dtype: object

In [23]:
# get the last row of the data
gap_df.iloc[-1]

country      Zimbabwe
continent      Africa
year             2007
lifeExp        43.487
pop          12311143
gdpPercap     469.709
Name: 1703, dtype: object

In [24]:
# get the first, 100th, and 1000th rows
gap_df.iloc[[0, 99, 999]]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
99,Bangladesh,Asia,1967,43.453,62821884,721.186086
999,Mongolia,Asia,1967,51.253,1149500,1226.04113


## Mixing it up
- The general syntax for $loc$ and $iloc$ uses square brackets with a comma. The part to the left of 
the comma is the row values to subset; the part to the right of the comma is the column values to 
subset. That is, df.loc[[rows], [columns]] or df.iloc[[rows], [columns]].

In [25]:
# get all the rows for the year and pop columns
gap_df.loc[:, ["year", "pop"]]

# can you include the last column using "loc"

Unnamed: 0,year,pop
0,1952,8425333
1,1957,9240934
2,1962,10267083
3,1967,11537966
4,1972,13079460
...,...,...
1699,1987,9216418
1700,1992,10704340
1701,1997,11404948
1702,2002,11926563


In [26]:
# use iloc to get all the rows for the year, pop, and the last columns
gap_df.iloc[:, [2, 4, -1]]

Unnamed: 0,year,pop,gdpPercap
0,1952,8425333,779.445314
1,1957,9240934,820.853030
2,1962,10267083,853.100710
3,1967,11537966,836.197138
4,1972,13079460,739.981106
...,...,...,...
1699,1987,9216418,706.157306
1700,1992,10704340,693.420786
1701,1997,11404948,792.449960
1702,2002,11926563,672.038623


In [25]:
# subsetting columns by Range
#gap_df.iloc[:, :3]

In [28]:
# subsetting multiple rows and columns
gap_df.iloc[[0, 99, 999], [0, 3, 5]]

Unnamed: 0,country,lifeExp,gdpPercap
0,Afghanistan,28.801,779.445314
99,Bangladesh,43.453,721.186086
999,Mongolia,51.253,1226.04113


## Grouped and Aggregated Calculations

In [31]:
# find the avreage gdpPercapita for each continent
# first, group by the continent
# Then find the average GDP for each continent

gap_df.groupby('continent')['gdpPercap'].mean()

continent
Africa       2193.754578
Americas     7136.110356
Asia         7902.150428
Europe      14469.475533
Oceania     18621.609223
Name: gdpPercap, dtype: float64

In [4]:
# we can also calculate the average GDP and average life expectancy for each continent




In [24]:
# find the frequency of unique countries in each continent
#gap_df.groupby('continent')['country'].nunique()



In [23]:
# count of each country in each continent
#gap_df.groupby('continent')['country'].value_counts()

## Data Assembly
- This involves combining various data sets together for analysis.

### Concatenation
- One of the easier ways to combine data is with concatenation. Concatenation can be thought of appending a row or column to 
your data. This approach is possible if your data was split into parts or if you performed a calculation that you want to 
append to your existing data set. 
- Concatenation is accomplished by using the $concat$ function from Pandas.

### Creating a DataFrame

In [2]:
d1 = {'A': ['a0', 'a1', 'a2', 'a3'], 'B': ['b0', 'b1', 'b2', 'b3'], 'C': ['c0', 'c1', 'c2', 'c3'], 'D': ['d0', 'd1', 'd2', 'd3']}
d2 = {'A': ['a4', 'a5', 'a6', 'a7'], 'B': ['b4', 'b5', 'b6', 'b7'], 'C': ['c4', 'c5', 'c6', 'c7'], 'D': ['d4', 'd5', 'd6', 'd7']}
d3 = {'A': ['a8', 'a9', 'a10', 'a11'], 'B': ['b8', 'b9', 'b10', 'b11'], 'C': ['c8', 'c9', 'c10', 'c11'], 'D': ['d8', 'd9', 'd10', 'd11']}

# creating dataframes 
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
df3 = pd.DataFrame(data = d3)

## Combining the dataframes together

In [22]:
# subset the fourth row of the concatenated dataframe
df_comb.iloc[3]

# what happens when you use "loc" to subset the new dataframe?



A    a3
B    b3
C    c3
D    d3
Name: 3, dtype: object

In [25]:
# how do we solve the issue above?
# hint: reset the index


## Using the "append" function
- concat is a general function that can concatenate multiple things at once. If you just needed to append a single object to an 
existing dataframe, the append function can handle that task.

In [22]:
# combine df1 and df2 


## Adding columns
- Concatenating columns is very similar to concatenating rows. The main difference is the axis parameter in the concat function.
The default value of axis is 0, so it will concatenate data in a row-wise fashion.

In [20]:
#col_concat = pd.concat([df1, df2, df3], axis = 1)


In [None]:
# we can also ignore index here



## Concatenating Rows with Different Columns

In [18]:
# let's modify our dataframes
df1.columns = ['A', 'B', 'C', 'D']
df2.columns = ['E', 'F', 'G', 'H']
df3.columns = ['A', 'C', 'F', 'H']

# examine df2


In [19]:
# concat the dataframes


## Resolving the NAN issues
- One way to avoid the inclusion of $NAN$ values is to keep only the columns that are shared in common by the list of objects to be concatenated.
- If we use the dataframes that have columns in common, only the columns that all of them share will be returned.

In [8]:
#pd.concat([df1, df2, df3], join = 'inner')


#pd.concat([df1, df3], join = 'inner') # we can also ignore the index

Unnamed: 0,A,C
0,a0,c0
1,a1,c1
2,a2,c2
3,a3,c3
0,a8,b8
1,a9,b9
2,a10,b10
3,a11,b11


## Missing Data
- In dattabases, they are NULL values.
- There is no one best approach to handle missing values. It may require adequate knowledge of the data at hand. 
However, sometimes we might need to fill some NAs with a zero, average or median of the column, or drop the 
column with the missing values. Note that the latter is not ideal as we might be losing great information.

In [16]:
cars = pd.read_csv('data/cars.csv')
cars.head()

# rename the 'Unnamed column'
#cars.rename(columns = {'Unnamed: 0': 'car_features'}, inplace = True)


Unnamed: 0,car_features,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6.0,160.0,110.0,3.9,2.62,16.46,0.0,1.0,4.0,4.0
1,Mazda RX4 Wag,21.0,6.0,160.0,110.0,3.9,2.875,17.02,0.0,1.0,4.0,4.0
2,Datsun 710,22.8,4.0,108.0,93.0,3.85,2.32,18.61,1.0,1.0,4.0,1.0
3,Hornet 4 Drive,21.4,6.0,258.0,110.0,3.08,3.215,19.44,1.0,0.0,3.0,1.0
4,Hornet Sportabout,18.7,8.0,360.0,175.0,3.15,3.44,17.02,0.0,0.0,3.0,2.0


In [None]:
# check out the information in the data

In [None]:
# find the missing values in the car data

## Filling missing values

In [None]:
# suppose we want to fill the Nas of 'disp' variable with zero

cars['disp'].fillna(value = 0, inplace = True)
cars.isna().sum()

In [None]:
# fill the NAs for a variable (e.g. mpg) with its mean

ave = cars['mpg'].mean()
ave
cars['mpg'].fillna(value = ave, inplace = True)


In [None]:
# filling the missing values for multiple variables

values = {'mpg': 0, 'disp': cars['disp'].mean()}
cars.fillna(value = values, inplace = True)


## Exercise 1
- Using the $gapminder$ data For each year in our data, what is the average life expectancy, average population, 
and average GDP?
- Find the average life expectancy and average GDP for $Americas$ for the year 2002.

### Exercise 2
- Using a subset of the $COVID-19$ data:C:\Users\ajayi\OneDrive\Desktop\Class_Data\Data\covid_subset.txt
    - load the data
    - find the dimensions of the data
    - select the following variables and save them in a new dataframe: continent, locations, total_cases, total_deaths, 
        gdp_per_capita, tests_per_case, life_expectancy,female_smokers, male_smokers, diabetes_prevalence.
    - Find the missing values in the variables selected
    - Devise an appropriate approach with reasonable explanation on how to handle the missing values.
- Find the Continent with the highest total cases and total deaths of coronavirus
- Which Country has the highest number of deaths?