# III. The Series Data Structure
A Series is similar to a list or an array in Python. It represents a series of values (numeric or otherwise) such as a column of data. A typical series has this form: <font color="red"><b>series = {index-0: value-0, index-1: value-1, index-2: value-2}</b></font>

To get started using a Series, the pandas library needs to be imported into the Python environment.

In [1]:
import pandas as pd

A series can be initialized based on a list or a dictionary. The differences between the two methods are straightforward.<br/> (1) If the series is initialized from a list, its index will be the numerical values 0,1,2, etc by default.<br/> (2) If the series is initialized from a dictionary, the keys of the dictionary will be mapped to the series' indices.

## 3.1 Initialize series from a list
Based on an existing list, a series can be easily initialized by <font color="red"><b>series = pd.Series(list, index=[list of indices])</b></font>. Similar to lists, a series can also carry different types data. Note that the <b>index</b> parameter in the command is not compulsory. The indices will be 0,1,2 etc by default, if the index parameter is not specified.

#### Question 1
Initialize a series with the following list. The resulting series carries numerical data.<br/>
`numbers = [1, 2, 3]`

In [3]:
numbers = [1, 2, 3]
series1 = pd.Series(numbers)
series1

0    1
1    2
2    3
dtype: int64

#### Question 2
Initialize a series with the following data and indices.<br/>
`data = ['Tiger', 'Bear', 'Moose'], index = ['India', 'America', 'Canada']`

In [6]:
index=['India', 'America', 'Canada']
series2 = pd.Series(['Tiger', 'Bear', 'Moose'], index = index)
series2

India      Tiger
America     Bear
Canada     Moose
dtype: object

## 3.2 Initialize series from a dictionary
1. A series can also be initialized from an existing dictionary by <font color="red"><b>series = pd.Series(dict)</b></font>.<br/>
2. The indices of the series can be accessed by <font color="red"><b>series.index</b></font> argument.

#### Question 3 
Initialize a series with the follwing list of sports.<br/>
`'Archery': 'Bhutan',`
`'Golf': 'Scotland',`
`'Sumo': 'Japan',`
`'Taekwondo': 'South Korea'`

In [11]:
series3 = pd.Series({'Archery': 'Bhutan', 'Golf': 'Scotland', 'Sumo': 'Japan', 'Taekwondo': 'South Korea'})
series3

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

In [12]:
series3.index

Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

## 3.3 Access series data
The data <b>values</b> in series can be accessed by its numerical index or actual index.<br/>
1. By numerical index: <font color="red"><b>series.iloc[numerical_id]</b></font>. The numerical indices start from 0.<br/>
2. By actual index: <font color="red"><b>series.loc[actual_id]</b></font>.<br/>

#### Question 4
From the series created in Question 3, extract the corresponding country data of any sport by its numerical index or actual index.

In [22]:
series3.col
series3.iloc[0]

'Bhutan'

In [24]:
series3.loc['Sumo']

'Japan'

## 3.4 Append new entries to series
1. Insert a single entry to an existing series: <font color="red"><b>series.loc[index] = value</b></font>.<br/>
2. Insert multiple entries to an existing series: <font color="red"><b>series1.append(series2)</b></font>. A new series needs to be initialized first to store the entires of data to be inserted.

#### Question 5
Insert the following list of sports to the series created in Question 3. <b>Hint</b>: use the append function.<br/>
`'Soccer': 'Brazil',`
`'Table Tennis': 'China',` 
`'Swimming': 'United States'`

In [25]:
add_series = pd.Series({'Soccer': 'Brazil', 'Table Tennis': 'China',  'Swimming': 'United States'})
series3.append(add_series)

Archery                Bhutan
Golf                 Scotland
Sumo                    Japan
Taekwondo         South Korea
Soccer                 Brazil
Swimming        United States
Table Tennis            China
dtype: object

## 3.5 Miscellaneous series applications
1. Sum up a numerical series with the numpy library: <font color="red"><b>total = np.sum(series)</b></font>
2. Find the length of a series: <font color="red"><b>len(series)</b></font>.

#### Question 6
Create a series with the following list of numbers. Add 2 to each number in the series and calculate the average value.<br/>
`[100, 120, 101, 3]`

In [27]:
num = pd.Series([100, 120, 101, 3])
num += 2
num

0    102
1    122
2    103
3      5
dtype: int64

In [30]:
import numpy as np
np.mean(num)

83.0

# IV. The DataFrame Data Structure
Dataframe is the most important data structure in analytics projects. In most cases, the data imported externally from spreadsheets will carry the dataframe format. A dataframe can be considered as an aggregation of many series of data. Therefore, the aforementioned data manipulation technics of series are also applicable to dataframes.

## 4.1 Initialize dataframe and access data in dataframe
Dataframes can be initialized with the <b>pd.DataFrame(data, index, <mark style="background-color: yellow;">columns</mark>)</b> command. Compared to the initialization of series, a list of column names needs to be provided to fully define a dataframe.
1. Initialize dataframe from series: <font color="red"><b>df = pd.DataFrame([list_of_series], index = [list_of_indices])</b></font>. The series indices are taken as column names of the dataframe by default.
2. Without providing the "data" argument, an empty dataframe can be declared first for storing data in later use.
3. Access row data: <font color="red"><b>df.iloc[numerical_row_id, :]</b></font> or <font color="red"><b>df.loc[actual_id, :]</b></font>. A series is returned.
4. Access column data: <font color="red"><b>df.iloc[:, numerical_col_id]</b></font> or <font color="red"><b>df.loc[:, col_name]</b></font>. A series is returned.
5. Access dataframe cell: <font color="red"><b>df.iloc[numerical_row_id, numerical_col_id]</b></font> or <font color="red"><b>df.loc[actual_id, col_name]</b></font>.

#### Question 7
Initialize a dataframe to store the following transaction record of Store 1 and Store 2.<br/>
`[{'Name': 'Chris', 'Item Purchased': 'Dog Food', 'Cost': 22.50},`<br/>
 `{'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},`<br/>
 `{'Name': 'Vinod', 'Item Purchased': 'Bird Seed', 'Cost': 5.00}]`<br/>
 The indices of the three transactions are `['Store 1', 'Store 1', 'Store 2']` respectively.
  


In [31]:
df = pd.DataFrame([{'Name': 'Chris', 'Item Purchased': 'Dog Food', 'Cost': 22.50},
{'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
{'Name': 'Vinod', 'Item Purchased': 'Bird Seed', 'Cost': 5.00}],index =['Store 1', 'Store 1', 'Store 2'] )
df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevyn
Store 2,5.0,Bird Seed,Vinod


#### Question 8
Extract the transaction record of Store 1 from the dataframe.

In [37]:
df.loc['Store 1',:]

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Dog Food,Chris
Store 1,2.5,Kitty Litter,Kevyn


#### Question 9
Extract the data of the 2nd column from the dataframe.

In [39]:
df.iloc[:,1]

Store 1        Dog Food
Store 1    Kitty Litter
Store 2       Bird Seed
Name: Item Purchased, dtype: object

#### Question 10
Extract the name of the customer(s) of Store 2.

In [40]:
df.loc['Store 2', 'Name']

'Vinod'

## 4.2 Miscellaneous dataframe manipulations
In this section, we walk through the following technics of dataframe manipulation with a case study on the Olympics dataset.<br/>
1. Load data from a csv file: <font color="red"><b>df = pd.read_csv("data.csv", parameters)</b></font>.
2. Display the top few entries of the dataframe: <font color="red"><b>df.head(n).</b></font> The default n is 5.
2. Extract the column names of the dataframe: <font color="red"><b>df.columns</b></font>.
3. Rename <b>all</b> dataframe columns: <font color="red"><b>df.columns = [list_of_col_names]</b></font>. The number of column names should match the actual number of columns of the dataframe.
4. Rename <b>specific</b> dataframe columns: <font color="red"><b>df.rename(columns={"oldName1": "newName1", "oldName2": "newName2"}, inplace = True)</b></font>.

<font color="blue"><b>Load dataset from the "olympics.csv" as a dataframe.</b> Set the first column as the index column of the dataframe, and skip the first row.</font>

In [6]:
import pandas as pd
df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
df.head()

Unnamed: 0,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !.1,02 !.1,03 !.1,Total.1,№ Games,01 !.2,02 !.2,03 !.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


<font color="blue">Extract names of the dataframe columns.</font>

In [7]:
df.columns

Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
       '02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
       'Combined total'],
      dtype='object')

<font color='blue'><b>Rename dataframe columns with types of medals.</b></font>

It can be observed that the columns with numerical names are not named properly! <b>"01", "02", and "03" should be mapped to gold, silver, and bronze medals respectively.</b> The columns of the dataframe are renamed with the following loop.<br/>

Explanation: <b>col[:2]</b> detects the first two characters in a column name. <b>col[4:]</b> fetches the identifier of the Olympic Games, i.e. whether the medals are counted for Summer Olympics, Winter Olympics, or both in total. Last but not the least, 'No' is replaced by '#'.

In [8]:
for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#' + col[1:]}, inplace=True) 

df.head()

Unnamed: 0,# Summer,Gold,Silver,Bronze,Total,# Winter,Gold.1,Silver.1,Bronze.1,Total.1,# Games,Gold.2,Silver.2,Bronze.2,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12


## 4.3 Extract dataframe rows and columns
1. Extract dataframe rows based on condition: <font color="red"><b>df2 = df[condition expression]</b></font>. For instance, <font color="red"><b>df2 = df[df['column'] > 0]</b></font> extracts the dataframe entries whose values of 'column' is positive.
2. Extract specific columns of a dataframe and form a new dataframe: <font color="red"><b>df2 = df[[list_of_columns]]</b></font>.

#### Question 11
Extract data of countries with gold medals in <b>both summer and winter</b> Olympic Games.

In [53]:
df2 = df[(df['Gold'] >0) & (df['Gold.1']>0)]
df2

Unnamed: 0,# Summer,Gold,Silver,Bronze,Total,# Winter,Gold.1,Silver.1,Bronze.1,Total.1,# Games,Gold.2,Silver.2,Bronze.2,Combined total
Australia (AUS) [AUS] [Z],25,139,152,177,468,18,5,3,4,12,43,144,155,181,480
Austria (AUT),26,18,33,35,86,22,59,78,81,218,48,77,111,116,304
Belarus (BLR),5,12,24,39,75,6,6,4,5,15,11,18,28,44,90
Belgium (BEL),25,37,52,53,142,20,1,1,3,5,45,38,53,56,147
Bulgaria (BUL) [H],19,51,85,78,214,19,1,2,3,6,38,52,87,81,220
Canada (CAN),25,59,99,121,279,22,62,56,52,170,47,121,155,173,449
China (CHN) [CHN],9,201,146,126,473,10,12,22,19,53,19,213,168,145,526
Croatia (CRO),6,6,7,10,23,7,4,6,1,11,13,10,13,11,34
Czech Republic (CZE) [CZE],5,14,15,15,44,6,7,9,8,24,11,21,24,23,68
Czechoslovakia (TCH) [TCH],16,49,49,45,143,16,2,8,15,25,32,51,57,60,168


#### Question 12
Extract the 3 columns that store the number of gold, silver, bronze medals won in the summer Olympic Games from the Question 10 dataframe.

In [64]:
df2[['Gold', 'Silver', 'Bronze']]

Unnamed: 0,Gold,Silver,Bronze
Australia (AUS) [AUS] [Z],139,152,177
Austria (AUT),18,33,35
Belarus (BLR),12,24,39
Belgium (BEL),37,52,53
Bulgaria (BUL) [H],51,85,78
Canada (CAN),59,99,121
China (CHN) [CHN],201,146,126
Croatia (CRO),6,7,10
Czech Republic (CZE) [CZE],14,15,15
Czechoslovakia (TCH) [TCH],49,49,45


## 4.4 Index dataframes
From section 4.1, we have touched on the importance of indices on accessing data in dataframe. The index column of dataframes often needs to be adjusted based on specific user cases.
1. Restore current index column: <font color="red"><b>df['column'] = df.index</b></font>. If the current index column is not restored, it will be eliminated when a new index column is set!
2. Set new index column: <font color="red"><b>df = df.set_index([list of index columns])</b></font>. Note that a dataframe can have multiple columns set as index columns.
3. Reset index column of a dataframe: <font color="red"><b>df.reset_index()</b></font>. The existing index column would be restored as a column named as "index". A default numerical index column will be created.

#### Question 13
Take the original dataset from 4.2 (<b>df</b>). Restore the current index column with the name <b>"Participant"</b>. Set <b>"Gold"</b> as the new index column.

In [11]:
df['Participant'] = df.index
df = df.set_index('Gold')
df.head()

Unnamed: 0_level_0,# Summer,Silver,Bronze,Total,# Winter,Gold.1,Silver.1,Bronze.1,Total.1,# Games,Gold.2,Silver.2,Bronze.2,Combined total,Participant
Gold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,13,0,2,2,0,0,0,0,0,13,0,0,2,2,0
5,12,2,8,15,3,0,0,0,0,15,5,2,8,15,5
18,23,24,28,70,18,0,0,0,0,41,18,24,28,70,18
1,5,2,9,12,6,0,0,0,0,11,1,2,9,12,1
3,2,4,5,12,0,0,0,0,0,2,3,4,5,12,3


## 4.3 Expand dataframes
Dataframes can be expanded by inserting new columns or merging two dataframes on one or multiple columns. Merging operation is very similar to the table join operation of database. Four types of merging: <b>inner, outer, left, right</b> are provided to handle different user cases, which are not explained here.<br/>
1. Insert new column to a dataframe: <font color="red"><b>df['new_column'] = [list_of_data]</b></font>.
2. Insert new column with a common value to a dataframe: <font color="red"><b>df['new_column'] = common_data</b></font>.
3. Merge dataframes on index column: <font color="red"><b>pd.merge(df1, df2, how = 'merge_type', left_index = True, right_index = True)</b></font>.
4. Merge dataframes on normal column: <font color="red"><b>pd.merge(df1, df2, how = 'merge_type', left_on = 'column1', right_on = 'column2')</b></font>.
5. Merge dataframes on multiple columns: <font color="red"><b>pd.merge(df1, df2, how = 'merge_type', left_on = [list of df1 columns], right_on = [list of df2 columns])</b></font>. The sequences of the columns in the two lists should match with each other.

A dataframe is initialized a list that stores the transaction record of two stores.

In [12]:
store_df = pd.DataFrame([{'Name': 'Chris', 'Item Purchased': 'Sponge', 'Cost': 22.50},
                         {'Name': 'Kevyn', 'Item Purchased': 'Kitty Litter', 'Cost': 2.50},
                         {'Name': 'Filip', 'Item Purchased': 'Spoon', 'Cost': 5.00}],
                        index=['Store 1', 'Store 1', 'Store 2'])
store_df

Unnamed: 0,Cost,Item Purchased,Name
Store 1,22.5,Sponge,Chris
Store 1,2.5,Kitty Litter,Kevyn
Store 2,5.0,Spoon,Filip


#### Question 14
Insert two columns into the <b>"store_df"</b> dataframe to store the dates and delivery statuses of the transactions. <i>The three transactions are processed on December 1, January 1, and mid-May. All the goods have been delivered.</i>

In [16]:
store_df['Date']=['December 1', 'January 1', 'May 15']
store_df['Delivery']='Done'
store_df

Unnamed: 0,Cost,Item Purchased,Name,Date,Delivery
Store 1,22.5,Sponge,Chris,December 1,Done
Store 1,2.5,Kitty Litter,Kevyn,January 1,Done
Store 2,5.0,Spoon,Filip,May 15,Done


Two dataframes are initialized to store the information of students and staff.

In [32]:
staff_df = pd.DataFrame([{'Name': 'Kelly', 'Role': 'Director of HR'},
                         {'Name': 'Sally', 'Role': 'Course liasion'},
                         {'Name': 'James', 'Role': 'Grader'}])
staff_df = staff_df.set_index('Name')
student_df = pd.DataFrame([{'Name': 'James', 'School': 'Business'},
                           {'Name': 'Mike', 'School': 'Law'},
                           {'Name': 'Sally', 'School': 'Engineering'}])
student_df = student_df.set_index('Name')
print(staff_df)
print()
print(student_df)

                 Role
Name                 
Kelly  Director of HR
Sally  Course liasion
James          Grader

            School
Name              
James     Business
Mike           Law
Sally  Engineering


#### Question 15
Perform a inner merge on the student and stuff dataframes.

In [19]:
school_role = pd.merge(staff_df, student_df, how='inner', left_index=True, right_index=True)
school_role

Unnamed: 0_level_0,Role,School
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Sally,Course liasion,Engineering
James,Grader,Business


#### Question 16
Reset the index columns of the student and staff dataframe. Perform a left merge on the 'Name' column.

In [35]:
student_df = student_df.reset_index()
staff_df= staff_df.reset_index()
pd.merge(staff_df, student_df, how='left', left_on='Name', right_on='Name')

Unnamed: 0,Name,Role,School
0,Kelly,Director of HR,
1,Sally,Course liasion,Engineering
2,James,Grader,Business


In [39]:
stu_sta = pd.merge(student_df, staff_df, how='left', left_on='Name', right_on='Name')
stu_sta

Unnamed: 0,Name,School,Role
0,James,Business,Grader
1,Mike,Law,
2,Sally,Engineering,Course liasion


## 4.4 Group dataframes

By "group by" we are referring to a process involving the following steps.
<ul>
    <li><b>Splitting</b> the data into groups based on some criteria</li>
    <li><b>Applying</b> a function to each group immediately</li>
    <li><b>Combining</b> the results into a data structure</li>
</ul>
Groupby is an important function used to reshape dataframes based on one or more critical columns. The usage of the groupby function can be very flexible. Here we introduce the most common group applications in Python with a case study on the demographic census of the United States.

1. Create a group variable that groups columnA by columnB: <b>groupby_var = df['columnA'].groupby(df['columnB'])</b>. This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df['columnB']. The idea is that this object has all of the information needed to then apply some operation to each of the groups.
2. Calculate statistics of each individual group: <font color="red"><b>df['columnA'].groupby(df['columnB']).mean()</b></font>. Mean is calculated here as an example.
3. Generate descriptive statistics by group: <font color="red"><b>df['columnA'].groupby(df['columnB']).describe()</b></font>. The summarized statistics of the grouped dataframe is displayed in the table format.
4. Aggregate the data in a column with specific function: <font color="red"><b>df.groupby('columnA').agg({'columnB': [list of functions], 'columnC': [list of functions]})</b></font>.
5. Cut dataframe column into mulltiple bins, and group the data by these bins: <font color="red"><b>bins = pd.cut(df['column'], no_of_bins), df.groupby(['index column', bins]).size()</b></font>. Please refer to the last example for real practice.

The following dataframe of demographic census is loaded.

In [40]:
census_df = pd.read_csv('census.csv')
census_df = census_df[census_df['SUMLEV']==50] # The entries on the state level are ignored.
census_df

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.592270,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.832960,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.500690,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.369970,1.859511,-0.848580,-1.402476,-1.577232,-0.884411
6,50,3,6,1,11,Alabama,Bullock County,10914,10915,10887,...,-30.953709,-5.180127,-1.130263,14.354290,-16.167247,-29.001673,-2.825524,1.507017,17.243790,-13.193961
7,50,3,6,1,13,Alabama,Butler County,20947,20946,20944,...,-14.032727,-11.684234,-5.655413,1.085428,-6.529805,-13.936612,-11.586865,-5.557058,1.184103,-6.430868
8,50,3,6,1,15,Alabama,Calhoun County,118572,118586,118437,...,-6.155670,-4.611706,-5.524649,-4.463211,-3.376322,-5.791579,-4.092677,-5.062836,-3.912834,-2.806406
9,50,3,6,1,17,Alabama,Chambers County,34215,34170,34098,...,-2.731639,3.849092,2.872721,-2.287222,1.349468,-1.821092,4.701181,3.781439,-1.290228,2.346901
10,50,3,6,1,19,Alabama,Cherokee County,25989,25986,25976,...,6.339327,1.113180,5.488706,-0.076806,-3.239866,6.416167,1.420264,5.757384,0.230419,-2.931307


#### Question 17
Calculate the average population of different states in 2010 using the <b>groupby</b> function.

In [44]:
mean_2010 = census_df['CENSUS2010POP'].groupby(census_df['STNAME']).mean()
mean_2010.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

#### Question 18
By aggregation, get the no. of cities, sum, and mean of the actual population of 2010 in each state. The data from the 2010 Census should be taken.

In [51]:
import numpy as np
mean_2010 = census_df.groupby('STNAME').agg({'CENSUS2010POP': [np.size, np.sum, np.average]})
mean_2010.head()

Unnamed: 0_level_0,CENSUS2010POP,CENSUS2010POP,CENSUS2010POP
Unnamed: 0_level_1,size,sum,average
STNAME,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Alabama,67,4779736,71339.343284
Alaska,29,710231,24490.724138
Arizona,15,6392017,426134.466667
Arkansas,75,2915918,38878.906667
California,58,37253956,642309.586207


#### Question 19
Cut the actual population of 2010 into 5 bins. Group the dataframe by state and the population bins, and list the size of each population bin.

In [None]:
bins = pd.cut(census_df['column'], no_of_bins), df.groupby(['index column', bins]).size()