In [48]:
import pandas as pd # A general purpose Python library for data analysis
import numpy as np # A library for scientific computing in Python (e.g., provides high-performance multi-dimensional array objects and operations)

import matplotlib.pyplot as plt # a plotting library for Python and NumPy (readily customizable)
import seaborn as sns # Another plotting library for Python (fewer syntax, excellent default themes, behind the scenes, it uses matplotlib)
import time

## Knowledge Stream Summer 2023

In this notebook, we will learn about the key data structures provided by the Pandas library: **Data Frames, Series, and Indices**.

In addition, we will learn about the following operations:
* How to access data contained in these structures?
* How to read files (e.g., csv, xlsx, sql) to create these structures?
* How to carry out different data manipulation tasks using these structures?

`Dataset`: US elections with information about candidates, their party, votes won, year of election and the result.

## Reading in Data Frames from Files

Pandas has a number of useful file reading tools. You can see them enumerated by typing **"pd.re"** and pressing `tab`. We'll be using **read_csv** today. Note that these file reading functions do all the *data parsing* for you, which is very useful.

Before loading a file into a dataframe, let's first take a look at the **elections.csv** file

In [47]:
#Answer Here
import pandas as pd
elections=pd.read_csv("elections.csv")
elections

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


We can use the **head command** to show only a few rows of a dataframe.

# heading
## heading2

In [11]:
# Answer Here
df.head()

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


There is also a **tail command**.

In [12]:
#Answer Here
df.tail()

Unnamed: 0,Candidate,Party,%,Year,Result
18,McCain,Republican,45.7,2008,loss
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


The `read_csv` command lets us specify a **column to use an index**. For example, we could have used __Year__ as the index.

In [13]:
#Answer Here
pd.read_csv("elections.csv", index_col="Year")

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1980,Anderson,Independent,6.6,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss
1988,Bush,Republican,53.4,win
1988,Dukakis,Democratic,45.6,loss
1992,Clinton,Democratic,43.0,win
1992,Bush,Republican,37.4,loss
1992,Perot,Independent,18.9,loss


In [97]:
pd.read_csv("elections.csv",index_col="Party")

Unnamed: 0_level_0,Candidate,%,Year,Result
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Republican,Reagan,50.7,1980,win
Democratic,Carter,41.0,1980,loss
Independent,Anderson,6.6,1980,loss
Republican,Reagan,58.8,1984,win
Democratic,Mondale,37.6,1984,loss
Republican,Bush,53.4,1988,win
Democratic,Dukakis,45.6,1988,loss
Democratic,Clinton,43.0,1992,win
Republican,Bush,37.4,1992,loss
Independent,Perot,18.9,1992,loss


Alternately, we could have used the **set_index** commmand on the dataframe.

In [14]:
#Answer Here
df=pd.read_csv("elections.csv")
df.set_index("Candidate")

Unnamed: 0_level_0,Party,%,Year,Result
Candidate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Reagan,Republican,50.7,1980,win
Carter,Democratic,41.0,1980,loss
Anderson,Independent,6.6,1980,loss
Reagan,Republican,58.8,1984,win
Mondale,Democratic,37.6,1984,loss
Bush,Republican,53.4,1988,win
Dukakis,Democratic,45.6,1988,loss
Clinton,Democratic,43.0,1992,win
Bush,Republican,37.4,1992,loss
Perot,Independent,18.9,1992,loss


# Caution:
The **set_index command** (along with all other data frame methods) **does not modify the dataframe**, i.e., the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe (e.g., `elections.set_index("Party",inplace=True)`).

## Duplicate Columns?
By contast, column names MUST be unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically any duplicates. Load duplicate_columns.csv

In [15]:
#Answer Here
dp=pd.read_csv("duplicate_columns.csv")
dp

Unnamed: 0,name,name.1,flavor
0,john,smith,vanilla
1,zhang,shan,chocolate
2,fulan,alfulani,strawberry
3,hong,gildong,banana


## The [ ] Operator & Indexing

The DataFrame class has an indexing operator **[ ]** (also known as the 'brack' operator) that lets you do a variety of different things. If your provide a String to the **[ ]** operator, you get back a ***Series*** corresponding to the requested label.

1.Use **[ ]** to display different columns

2.Use List retrive multiple columns

In [16]:
# Answer Here
sp=df['Year']
sp

0     1980
1     1980
2     1980
3     1984
4     1984
5     1988
6     1988
7     1992
8     1992
9     1992
10    1996
11    1996
12    1996
13    2000
14    2000
15    2004
16    2004
17    2008
18    2008
19    2012
20    2012
21    2016
22    2016
Name: Year, dtype: int64

The **[ ]** operator also accepts a list of strings. In this case, you get back a **DataFrame** corresponding to the requested strings.

In [None]:
# Answer Here


A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

Note that we can also use the **to_frame** method to turn a Series into a DataFrame.

Extract one col name "Candidates" from DataFrame it will be a series. Convert series into a DataFrame.

In [17]:
Year_df=sp.to_frame()
Year_df

Unnamed: 0,Year
0,1980
1,1980
2,1980
3,1984
4,1984
5,1988
6,1988
7,1992
8,1992
9,1992


In [None]:
# Answer Here

### Row Indexing

The `[]` operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

Extract few rows from DataFrame

In [84]:
# Answer Here
dp[1:2]
dp

Unnamed: 0,name,name.1,flavor
0,john,smith,vanilla
1,zhang,shan,chocolate
2,fulan,alfulani,strawberry
3,hong,gildong,banana


If you provide a single argument to the `[]` operator, it tries to use it as a name. This is true even if the argument passed to **[ ]** is an integer.

In [104]:
df[0] #this does not work, try uncommenting this to see it fail in action, woo


KeyError: 0

The following cells allow you to **test your understanding**. Let's go over the summary of what we have learnt (see slides).

# Creating DataFrames
Create DataFrame using List and Columns name.

In [1]:
# Answer Here
import pandas as pd
data = [
    ['John', 25, 'New York'],
    ['Alice', 30, 'Los Angeles'],
    ['Bob', 35, 'Chicago'],
    ['Emma', 40, 'Houston']
]


columns = ['Name', 'Age', 'City']

df = pd.DataFrame(data, columns=columns)
print(df)


    Name  Age         City
0   John   25     New York
1  Alice   30  Los Angeles
2    Bob   35      Chicago
3   Emma   40      Houston


Creating DataFrames using **Dictionary**.

In [2]:
# Answer Here
import pandas as pd

data = {
    'Name': ['John', 'Alice', 'Bob', 'Emma'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}


df = pd.DataFrame(data)

print(df)


    Name  Age         City
0   John   25     New York
1  Alice   30  Los Angeles
2    Bob   35      Chicago
3   Emma   40      Houston


## Filtering via Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a **filtered version of the data frame**, where **only rows corresponding to True appear**.

In [6]:
elections[[False, False, False, False, False,
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, True, False]]

NameError: name 'elections' is not defined

One very common task in Data Science is **filtering**. Boolean Array Selection is one way to achieve this in Pandas. We start by observing that **logical operators** like the equality operator can be applied to **Pandas Series data** to generate a **Boolean Array**.

Compare the 'Result' column to the String 'win' and Show results

In [18]:
#Answer Here
winners_df = elections[elections['Result'] == 'win']
winners_df

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


Compare the 'Party' column to the String 'Democratic' and Show results

In [20]:
#Answer Here
democratic_candidates_df = elections[elections['Party'] == 'Democratic']
democratic_candidates_df

Unnamed: 0,Candidate,Party,%,Year,Result
1,Carter,Democratic,41.0,1980,loss
4,Mondale,Democratic,37.6,1984,loss
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
13,Gore,Democratic,48.4,2000,loss
15,Kerry,Democratic,48.3,2004,loss
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
21,Clinton,Democratic,48.2,2016,loss


The output of the logical operator applied to the Series is **another Series with the same name and index, but of datatype boolean**.

These boolean Series can be used as an argument to the `[]` operator.

Creates  DataFrame of all election winners since 1980.

In [23]:
#Answer Here
elections_since_1980 = elections[elections['Year'] >= 1980]
elections_since_1980

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

Show all 'win' results between 1980 to 2000

In [24]:
#Answer Here
win_results_1980_to_2000 = elections[(elections['Result'] == 'win') & (elections['Year'].between(1980, 2000))]
win_results_1980_to_2000

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win


Show all 'Loss' results of Independent party

In [27]:
# Answer Here
loss_results_independent = elections[(elections['Result'] == 'Loss') & (elections['Party'] == 'Independent')]
loss_results_independent

Unnamed: 0,Candidate,Party,%,Year,Result


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

Show results of win with percentage less than 50%

In [30]:
# Answer Here



win_results = elections[elections['Result'] == 'win']

# Filter rows where the win percentage (from the 'Percentage' column) is less than 50%
win_less_than_50_percent = win_results[win_results['%'] < 50]

# Show the filtered results
print(win_less_than_50_percent)

   Candidate       Party     %  Year Result
7    Clinton  Democratic  43.0  1992    win
10   Clinton  Democratic  49.2  1996    win
14      Bush  Republican  47.9  2000    win
22     Trump  Republican  46.1  2016    win


Show all 'win' results between 1980 to 2000

In [31]:
# Answer Here
win_results_1980_to_2000 = elections[(elections['Result'] == 'win') & (elections['Year'].between(1980, 2000))]
win_results_1980_to_2000

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win


## Loc and iLoc

Show 5 enteries from start

In [32]:
# Answer Here

elections.loc[:4]


elections.iloc[:5]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


You can provide `.loc` a list of row labels [0-5] and column labels ['Candidate','Party', 'Year'] as input to return a dataframe

In [None]:
#Answer Here


Loc also supports **slicing** (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

Use Slicing on Rows and Columns

In [36]:
# Answer Here
sliced_data = elections.loc[1:5, 'Year':'Party']
sliced_data

1
2
3
4
5


If we provide only a **single label** for the column argument, we get back a **Series**.

In [41]:
# Answer Here
elections.loc[1]

Candidate        Carter
Party        Democratic
%                  41.0
Year               1980
Result             loss
Name: 1, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provide a **list** containing the column name.

In [44]:
# Answer Here
elections.loc[0].to_frame()

Unnamed: 0,0
Candidate,Reagan
Party,Republican
%,50.7
Year,1980
Result,win


If we give only one row but many column labels, we'll get back a **Series** corresponding to a row of the table. This new Series has a neat index, where **each entry is the name of the column** that the data came from.

In [64]:
# Answer Here
elections.loc[0,['Year','Party','Candidate']]

Year               1980
Party        Republican
Candidate        Reagan
Name: 0, dtype: object

In [67]:
# Answer Here
elections.loc[0:2,['Year','Party','Candidate']]

Unnamed: 0,Year,Party,Candidate
0,1980,Republican,Reagan
1,1980,Democratic,Carter
2,1980,Independent,Anderson


If we omit the column argument altogether, the **default behavior is to retrieve all columns**.

In [68]:
# Answer Here
elections.loc[:,['Year','Party','Candidate']]

Unnamed: 0,Year,Party,Candidate
0,1980,Republican,Reagan
1,1980,Democratic,Carter
2,1980,Independent,Anderson
3,1984,Republican,Reagan
4,1984,Democratic,Mondale
5,1988,Republican,Bush
6,1988,Democratic,Dukakis
7,1992,Democratic,Clinton
8,1992,Republican,Bush
9,1992,Independent,Perot


Specify Rows and Columns as List to retrive specific enteries

In [69]:
# Answer Here
elections.loc[0:3,['Year','Party','Candidate']]

Unnamed: 0,Year,Party,Candidate
0,1980,Republican,Reagan
1,1980,Democratic,Carter
2,1980,Independent,Anderson
3,1984,Republican,Reagan


Boolean Series are also boolean arrays, so we can use the Boolean Array Selection from earlier using loc as well.

In [74]:
# Answer Here
a=elections['Year']==2012
elections.loc[a]

Unnamed: 0,Candidate,Party,%,Year,Result
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss


## String-labeled Rows

Let's do a quick example using data with string-labeled rows instead of integer labeled rows, just to make sure we're really understanding loc.

Use mottos.csv file

In [81]:
# Answer Here
mottos=pd.read_csv('C:\\Users\\MOEED\Desktop\Knowledge Streams\Pandas\mottos.csv')
mottos

Unnamed: 0,State,Motto,Translation,Language,Date Adopted
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
1,Alaska,North to the future,—,English,1967
2,Arizona,Ditat Deus,God enriches,Latin,1863
3,Arkansas,Regnat populus,The people rule,Latin,1907
4,California,Eureka (Εὕρηκα),I have found it,Greek,1849
5,Colorado,Nil sine numine,Nothing without providence.,Latin,"November 6, 1861"
6,Connecticut,Qui transtulit sustinet,He who transplanted sustains,Latin,"October 9, 1662"
7,Delaware,Liberty and Independence,—,English,1847
8,Florida,In God We Trust,—,English,1868
9,Georgia,"Wisdom, Justice, Moderation",—,English,1798


Extract slice, can be specified using slice notation, even if the rows have string labels instead of integer labels.

### iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. 'iloc' slicing is **exclusive**, just like standard Python slicing of numerical values.

Use iloc to extract first 3 rows and columns from elections DataFrame

In [90]:
#Answer Here
motto.iloc[0:3,0:3]

Unnamed: 0,State,Motto,Translation
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!
1,Alaska,North to the future,—
2,Arizona,Ditat Deus,God enriches


We will use both `loc` and `iloc` in the course. `loc` is generally preferred for a number of reasons, for example:

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g., what column #17 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Handy Properties and Utility Functions for Series and DataFrames

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. For example:

In [5]:
mottos = pd.read_csv("/content/drive/MyDrive/mottos.csv")

In [95]:
# Answer Here
motto.head()



Unnamed: 0,State,Motto,Translation,Language,Date Adopted
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
1,Alaska,North to the future,—,English,1967
2,Arizona,Ditat Deus,God enriches,Latin,1863
3,Arkansas,Regnat populus,The people rule,Latin,1907
4,California,Eureka (Εὕρηκα),I have found it,Greek,1849


Size of DataFrame

In [94]:
# Answer Here
motto.size

250

The fact that the size is 250 means our data file is relatively small, with only 250 total entries.

Shape of DataFrame

In [99]:
# Answer Here
motto.shape

(50, 5)

Use describe function and extract the meaningful information from DataFrame

In [97]:
# Answer Here
motto.describe()

Unnamed: 0,State,Motto,Translation,Language,Date Adopted
count,50,50,49,50,50
unique,50,50,30,8,47
top,Alabama,Audemus jura nostra defendere,—,Latin,1893
freq,1,1,20,23,2


Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [None]:
# Answer Here

In [101]:
motto.head(2)

Unnamed: 0,State,Motto,Translation,Language,Date Adopted
0,Alabama,Audemus jura nostra defendere,We dare defend our rights!,Latin,1923
1,Alaska,North to the future,—,English,1967


It turns out the columns also have an Index. We can access this index by using `.columns`.

In [18]:
# Answer Here

Index(['State', 'Motto', 'Translation', 'Language', 'Date Adopted'], dtype='object')

## Sorting and Value Counts

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [106]:
# Answer Here
motto.sort_values

<bound method DataFrame.sort_values of              State                                              Motto  \
0          Alabama                      Audemus jura nostra defendere   
1           Alaska                                North to the future   
2          Arizona                                         Ditat Deus   
3         Arkansas                                     Regnat populus   
4       California                                    Eureka (Εὕρηκα)   
5         Colorado                                    Nil sine numine   
6      Connecticut                            Qui transtulit sustinet   
7         Delaware                           Liberty and Independence   
8          Florida                                    In God We Trust   
9          Georgia                        Wisdom, Justice, Moderation   
10          Hawaii                  Ua mau ke ea o ka ʻāina i ka pono   
11           Idaho                                      Esto perpetua   
12        Il

As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

If we want to sort in reverse order, we can set `ascending=False`.

In [102]:
elections.sort_values('%', ascending=False)

Unnamed: 0,Candidate,Party,%,Year,Result
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
0,Reagan,Republican,50.7,1980,win
16,Bush,Republican,50.7,2004,win
10,Clinton,Democratic,49.2,1996,win
13,Gore,Democratic,48.4,2000,loss
15,Kerry,Democratic,48.3,2004,loss
21,Clinton,Democratic,48.2,2016,loss


We can also use `sort_values` on Series objects.

In [107]:
motto['Language'].sort_values().head(50)

46    Chinook Jargon
49           English
29           English
28           English
27           English
26           English
48           English
37           English
38           English
40           English
17           English
34           English
42           English
14           English
41           English
12           English
1            English
13           English
8            English
7            English
9            English
43           English
22            French
4              Greek
10          Hawaiian
19           Italian
39             Latin
44             Latin
36             Latin
45             Latin
47             Latin
35             Latin
33             Latin
0              Latin
31             Latin
30             Latin
23             Latin
21             Latin
20             Latin
18             Latin
16             Latin
15             Latin
11             Latin
6              Latin
5              Latin
3              Latin
2              Latin
32           

For Series, the `value_counts` method is often quite handy.

In [109]:
motto['Language'].value_counts()

Language
Latin             23
English           21
Greek              1
Hawaiian           1
Italian            1
French             1
Spanish            1
Chinook Jargon     1
Name: count, dtype: int64

Also commonly used is the `unique` method, which returns **all unique values** as a numpy array.

In [110]:
motto['Language'].unique()

array(['Latin', 'English', 'Greek', 'Hawaiian', 'Italian', 'French',
       'Spanish', 'Chinook Jargon'], dtype=object)

In [115]:
def fiba(n):
    if n < 2:
        return n
    else:
        return fiba(n-1) + fiba(n-2)



fiba(4)

3

# Thank you!