# Pandas introduction

This notebook is a quick introduction to the numpy and pandas libraries. It is intended to be a quick reference for the most common operations.

The first thing we need to do is import `pandas`. We will use the standard alias for this library, here goes `pd` :)

```python

In [2]:
import numpy as np
import pandas as pd

# <center>Pandas</center>

<center><img src=https://c.tenor.com/tIcg38r9_LMAAAAC/hi-hello.gif></center>

## <center>Basic loading</center>

### Task 1 (1 point)

Load the dataframe about the food facts from url:(https://www.kaggle.com/openfoodfacts/world-food-facts/data). 

In [2]:
food = pd.read_csv('./data/en.openfoodfacts.org.products.tsv', sep='\t')

  food = pd.read_csv('./data/en.openfoodfacts.org.products.tsv', sep='\t')


#### a) Check the first and last 5 elements 

#### b) figure out the number of rows and columns

#### c) Print information about it

#### d) Show the types of the data

#### e) How is the data indexed?

### Task 2 (2 points)

We run the data from [GitHub](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

a) Load the dataframe about the users

b) Change the columns to capital letters

c) Print the occupation and gender of the employees

d) How many unique occupations are there?

e) Summerize the information about the users

f) What is the mean age?

g) What is the occupation with least occurences?

#### a)

In [17]:
users = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      sep='|', index_col='user_id')
users

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
...,...,...,...,...
939,26,F,student,33319
940,32,M,administrator,02215
941,20,M,student,97229
942,48,F,librarian,78209


#### b)

#### c)

#### d)

#### e)

#### f)

#### g)

## <center>Filtering and sorting data</center>

### Task 1 (1 point)

#### <center>Otter</center>
<center><img src = https://www.otterspecialistgroup.org/osg-newsite/wp-content/uploads/2017/04/ThinkstockPhotos-827261360.jpg width=160 height=160></center>

In [23]:
users = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      sep='|', index_col='user_id')
users

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
...,...,...,...,...
939,26,F,student,33319
940,32,M,administrator,02215
941,20,M,student,97229
942,48,F,librarian,78209


#### a) Sort users by occupation

#### b) What is the most common zip_code

#### c) What is the most common zip code for people over 30?

#### d) What is most common zip code for women?

### Task 2 (1.5 point)

#### We'll use the previous dataframe and the one from [Chipotle Exercises Video](https://github.com/justmarkham) Tutorial, where you can someone else go through the exercises.

In [5]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep = '\t')
chipo

# Set your index
your_idx = 236490

#### a) Change the item_price column to be numeric and in USD by default

#### b) Drop the duplicates treating quantity, item_name, choice_description as index for that change

#### c) Calculate the information for one quantity

#### d) Products that cost more than ```int(your_idx[0:2])/2```$

## <center>Grouping and others</center>

### Task 1 (2 points)

For the next set of questions, we will be using census data from the [United States Census Bureau](http://www.census.gov). Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. [See this document](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2015/co-est2015-alldata.pdf) for a description of the variable names. (credit : University of Michigan)

The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.

In [189]:
census_df = pd.read_csv('./data/census.csv')
census_df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


#### a) Which county has the most cities in it

#### b) **Only looking at the two most populous counties for each state**, what are the five most populous states (in order of highest population to lowest population)? Use `CENSUS2010POP`.

*This function should return a list of string values.*

#### c) Which county has had the largest absolute change in population within the period 2010-2014?
e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-100| = 30.

### Task 2 (1.5 points)

In census datafile, we have many regions choosen with *REGION* column.

Create a query that finds the counties that belong to regions 1 or 3, whose name starts with 'W', and whose Y = POPESTIMATE201(```your_idx[-1] % 5```) was greater than their POPESTIMATE2014.

*This function should return a DataFrame with the columns = ['STNAME', 'CTYNAME', Y, 'POPESTIMATE2014'] and the same index ID as the census_df (sorted ascending by index).*

### Task 3 (2 points)

In [27]:
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user', 
                      sep='|', index_col='user_id')
users

Unnamed: 0_level_0,age,gender,occupation,zip_code
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,24,M,technician,85711
2,53,F,other,94043
3,23,M,writer,32067
4,24,M,technician,43537
5,33,F,other,15213
...,...,...,...,...
939,26,F,student,33319
940,32,M,administrator,02215
941,20,M,student,97229
942,48,F,librarian,78209


#### a) mean age per occupation

#### b) winning gender per age interval of 5 years

#### c) Discover the Female ratio per occupation and sort it from the most to the least


#### d) Calculate minimum and maximum ages for each occupation

#### e)  For each occupation present the percentage of women and men

## <center>Merging data</center>

In [46]:
# fun
s1 = pd.Series(np.random.randint(1, high=12, size=100, dtype='l'))
s2 = pd.Series(np.random.binomial(123, 0.3, 1000))
s3 = pd.Series(np.random.randint(10000, high=30001, size=100, dtype='l'))

s1

0      7
1      3
2      5
3      6
4     11
      ..
95     5
96     1
97     2
98     3
99     4
Length: 100, dtype: int32

In [47]:
s2

0      25
1      33
2      34
3      33
4      35
       ..
995    27
996    29
997    32
998    41
999    42
Length: 1000, dtype: int32

In [48]:
s3

0     10073
1     13750
2     15179
3     12980
4     26735
      ...  
95    28679
96    26132
97    18600
98    12453
99    15683
Length: 100, dtype: int32

### Task 1 (1.5 point)

#### a) Join all above series into DataFrame by column

#### b) Fill NaNs value from 1 to 9 for first column and a big integer for the last one

#### c) Change the name of the columns to `floors`, `security workers` and `electricity bill`

#### d) Create a one column DataFrame with the values of the 3 Series and assign it to 'bigcolumn'. Assert that the index is correct.

## <center>Apply</center>

From [GitHub](https://github.com/guipsamora). Refer for credit.

In [6]:
csv_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv'
df = pd.read_csv(csv_url)
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [7]:
df.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [9]:
stud_alcoh = df.loc[: , "school":"guardian"]
stud_alcoh.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother
4,GP,F,16,U,GT3,T,3,3,other,other,home,father


### Task 1 (0.5 point)

#### Apply lambda function to capitalize strings.

### Task 2 (0.5 point)

#### a)  Create a function called majority that returns a boolean value to a new column called legal_drinker (Consider majority as older than 15 years old)

# <center>That's all folks</center>
<center><img src = https://acegif.com/wp-content/uploads/gif/panda-8.gif></center>