# Chapter 1: World of Datasets

## Author: Audrey Marthin

### Sources: 
- https://pandas.pydata.org/docs/ 
- https://ourworldindata.org/
- Professor Wirfs-Brock's Class Code Demo
- Audrey's projects

Date: Saturday, December 9

### Part 1: Intro to Pandas
Pandas is a Python library that helps work with structured data. Pandas provides accessible data structures like DataFrame, which is basically a table in a database or kind of like an Excel spreadsheet. 

In [158]:
# Import packages 
import numpy as np
import pandas as pd

In [159]:
# Load the data in the form of a CSV file

# Import "private-investment-in-artificial-intelligence-by-focus-area.csv" and save it as a dataframe
# Note: make sure CSV file is in the correct file path
df_ai_investment = pd.read_csv("data/private-investment-in-artificial-intelligence-by-focus-area.csv")

In [160]:
# Examine dataframe
df_ai_investment

Unnamed: 0,Entity,Code,Year,World,European Union and United Kingdom,China,United States
0,AI ventures,,2017,705626292,34720055,233168581,404163765
1,AI ventures,,2018,1539250299,30321937,591347599,857474812
2,AI ventures,,2019,1212544984,15388467,93042361,962517568
3,AI ventures,,2020,1545821073,5170265,91954563,1358836062
4,AI ventures,,2021,2379370717,1013359242,258977775,923024967
...,...,...,...,...,...,...,...
155,Venture capital,,2018,217600809,0,107910146,109690663
156,Venture capital,,2019,106903067,26497349,0,28617137
157,Venture capital,,2020,179347432,0,124590452,4502008
158,Venture capital,,2021,225784599,0,0,60011724


We can observe there are 7 columns and 160 rows, each representing private investments in artificial intelligence 

In [161]:
# Displays the header
df_ai_investment.head()


Unnamed: 0,Entity,Code,Year,World,European Union and United Kingdom,China,United States
0,AI ventures,,2017,705626292,34720055,233168581,404163765
1,AI ventures,,2018,1539250299,30321937,591347599,857474812
2,AI ventures,,2019,1212544984,15388467,93042361,962517568
3,AI ventures,,2020,1545821073,5170265,91954563,1358836062
4,AI ventures,,2021,2379370717,1013359242,258977775,923024967


In [162]:
# Displays the first 10 rows
df_ai_investment.head(10)

Unnamed: 0,Entity,Code,Year,World,European Union and United Kingdom,China,United States
0,AI ventures,,2017,705626292,34720055,233168581,404163765
1,AI ventures,,2018,1539250299,30321937,591347599,857474812
2,AI ventures,,2019,1212544984,15388467,93042361,962517568
3,AI ventures,,2020,1545821073,5170265,91954563,1358836062
4,AI ventures,,2021,2379370717,1013359242,258977775,923024967
5,AI ventures,,2022,1242838005,14814431,451988999,641729703
6,Agricultural technology,,2017,309478503,7554717,10391318,261846503
7,Agricultural technology,,2018,280024054,33194058,0,194254290
8,Agricultural technology,,2019,502672339,116117865,18018197,283437548
9,Agricultural technology,,2020,686735808,25459779,7959031,624487776


In [163]:
# Get dataframe summary
df_ai_investment.describe()

Unnamed: 0,Code,Year,World,European Union and United Kingdom,China,United States
count,0.0,160.0,160.0,160.0,160.0,160.0
mean,,2019.375,4774578000.0,469283700.0,1106503000.0,2637808000.0
std,,1.872509,14142470000.0,1459440000.0,3012095000.0,7897088000.0
min,,2013.0,14814430.0,0.0,0.0,0.0
25%,,2018.0,636550900.0,32701040.0,42766930.0,282865100.0
50%,,2019.0,1497970000.0,94379970.0,234287100.0,663885900.0
75%,,2021.0,3361576000.0,259983900.0,846222900.0,1797541000.0
max,,2022.0,125356900000.0,12500780000.0,22848300000.0,73395940000.0


In [164]:
# Let's select the single column "United States"
df_ai_investment["United States"]


0       404163765
1       857474812
2       962517568
3      1358836062
4       923024967
          ...    
155     109690663
156      28617137
157       4502008
158      60011724
159             0
Name: United States, Length: 160, dtype: int64

In [165]:
# Selecting "AI ventures" data year 2018 which is in the 1st row
df_ai_investment.iloc[1]

Entity                               AI ventures
Code                                         NaN
Year                                        2018
World                                 1539250299
European Union and United Kingdom       30321937
China                                  591347599
United States                          857474812
Name: 1, dtype: object

In [166]:
# Select value within the "United States" column?
df_ai_investment["United States"][5]

641729703

In [167]:
# Get the "United States" data of the 5th row
df_ai_investment[ "United States" ].iloc[4]

923024967

In [168]:
# Types of data in each column
for col in df_ai_investment:
    print(col)
    print(type(df_ai_investment[col][0]))

Entity
<class 'str'>
Code
<class 'numpy.float64'>
Year
<class 'numpy.int64'>
World
<class 'numpy.int64'>
European Union and United Kingdom
<class 'numpy.int64'>
China
<class 'numpy.int64'>
United States
<class 'numpy.int64'>


In [169]:
# Can also use .dtypes
df_ai_investment.dtypes

Entity                                object
Code                                 float64
Year                                   int64
World                                  int64
European Union and United Kingdom      int64
China                                  int64
United States                          int64
dtype: object

In [170]:
# We can see column "code" has NaN values
# Use .isna() to formally check
df_ai_investment["Code"].isna()

0      True
1      True
2      True
3      True
4      True
       ... 
155    True
156    True
157    True
158    True
159    True
Name: Code, Length: 160, dtype: bool

In [171]:
# Fill NaN values with a certain value
df_ai_investment["Code"] = df_ai_investment["Code"].fillna("N")
df_ai_investment

Unnamed: 0,Entity,Code,Year,World,European Union and United Kingdom,China,United States
0,AI ventures,N,2017,705626292,34720055,233168581,404163765
1,AI ventures,N,2018,1539250299,30321937,591347599,857474812
2,AI ventures,N,2019,1212544984,15388467,93042361,962517568
3,AI ventures,N,2020,1545821073,5170265,91954563,1358836062
4,AI ventures,N,2021,2379370717,1013359242,258977775,923024967
...,...,...,...,...,...,...,...
155,Venture capital,N,2018,217600809,0,107910146,109690663
156,Venture capital,N,2019,106903067,26497349,0,28617137
157,Venture capital,N,2020,179347432,0,124590452,4502008
158,Venture capital,N,2021,225784599,0,0,60011724


In [172]:
# Removing code column 
# Select specific subset of columns in dataframe using column names list
col_wanted = ["Entity", "Year", "World", "European Union and United Kingdom", "China", "United States"]
df_ai_investment_new = df_ai_investment[col_wanted]
df_ai_investment_new

Unnamed: 0,Entity,Year,World,European Union and United Kingdom,China,United States
0,AI ventures,2017,705626292,34720055,233168581,404163765
1,AI ventures,2018,1539250299,30321937,591347599,857474812
2,AI ventures,2019,1212544984,15388467,93042361,962517568
3,AI ventures,2020,1545821073,5170265,91954563,1358836062
4,AI ventures,2021,2379370717,1013359242,258977775,923024967
...,...,...,...,...,...,...
155,Venture capital,2018,217600809,0,107910146,109690663
156,Venture capital,2019,106903067,26497349,0,28617137
157,Venture capital,2020,179347432,0,124590452,4502008
158,Venture capital,2021,225784599,0,0,60011724


In [173]:
# We can count the values of the private ai investment entity in the dataframe
df_ai_investment_new["Entity"].value_counts()


Entity
Total                                            10
AI ventures                                       6
Agricultural technology                           6
Semiconductors                                    6
Sales enablement                                  6
Retail                                            6
Natural Language Processing, customer support     6
Music and video content                           6
Medical and healthcare                            6
Marketing and digital ads                         6
Legal technology                                  6
Insurance technology                              6
Industrial automation                             6
Human Resources technology                        6
Geospatial                                        6
Fitness and wellness                              6
Financial technology                              6
Facial recognition                                6
Entertainment                                     6
Energ

In [174]:
# We can get the unique entity names
entity = df_ai_investment_new["Entity"].unique()
print(entity)

# And get # of entity present
print(len(entity))

# We can also get number of unique entity this way
n_entity = df_ai_investment_new["Entity"].nunique()
n_entity

['AI ventures' 'Agricultural technology' 'Augmented or virtual reality'
 'Cybersecurity' 'Data management' 'Drones' 'Educational technology'
 'Energy, oil and gas' 'Entertainment' 'Facial recognition'
 'Financial technology' 'Fitness and wellness' 'Geospatial'
 'Human Resources technology' 'Industrial automation'
 'Insurance technology' 'Legal technology' 'Marketing and digital ads'
 'Medical and healthcare' 'Music and video content'
 'Natural Language Processing, customer support' 'Retail'
 'Sales enablement' 'Semiconductors' 'Total' 'Venture capital']
26


26

### Part 2: Calculations and Grouping

#### Start with calculations of data

In [175]:
# Find Shares of respective regions
region = ["European Union and United Kingdom", "China", "United States"]

# Create function for getting share
def getShare(df, regions, world):
    for reg in regions:
        col_name = "Share of " + reg
        df[col_name] = df[reg]/df[world]*100

# Call function
getShare(df_ai_investment_new, region, "World")

In [176]:
# Examine
df_ai_investment_new

Unnamed: 0,Entity,Year,World,European Union and United Kingdom,China,United States,Share of European Union and United Kingdom,Share of China,Share of United States
0,AI ventures,2017,705626292,34720055,233168581,404163765,4.920459,33.044203,57.277311
1,AI ventures,2018,1539250299,30321937,591347599,857474812,1.969916,38.417897,55.707302
2,AI ventures,2019,1212544984,15388467,93042361,962517568,1.269105,7.673312,79.379947
3,AI ventures,2020,1545821073,5170265,91954563,1358836062,0.334467,5.948590,87.903839
4,AI ventures,2021,2379370717,1013359242,258977775,923024967,42.589380,10.884297,38.792819
...,...,...,...,...,...,...,...,...,...
155,Venture capital,2018,217600809,0,107910146,109690663,0.000000,49.590875,50.409125
156,Venture capital,2019,106903067,26497349,0,28617137,24.786332,0.000000,26.769239
157,Venture capital,2020,179347432,0,124590452,4502008,0.000000,69.468768,2.510216
158,Venture capital,2021,225784599,0,0,60011724,0.000000,0.000000,26.579193


In [177]:
# Df for only the year 2022
# Do so by doing boolean masking
df_ai_investment_2022 = df_ai_investment_new[df_ai_investment_new["Year"] == 2022]
df_ai_investment_2022

Unnamed: 0,Entity,Year,World,European Union and United Kingdom,China,United States,Share of European Union and United Kingdom,Share of China,Share of United States
5,AI ventures,2022,1242838005,14814431,451988999,641729703,1.191984,36.367491,51.634219
11,Agricultural technology,2022,810137282,76739673,95402811,509921645,9.472428,11.776129,62.942622
17,Augmented or virtual reality,2022,2214330335,60016045,13332988,1912898173,2.710347,0.602123,86.387209
23,Cybersecurity,2022,4978670210,217439686,992985857,3578690067,4.367425,19.944801,71.88044
29,Data management,2022,5423724972,219221105,1727359990,2894398654,4.041892,31.848222,53.365513
35,Drones,2022,1745272701,39090208,25925254,1483572636,2.239777,1.485456,85.005205
41,Educational technology,2022,344161272,95099978,12027466,111497296,27.632388,3.494718,32.396817
47,"Energy, oil and gas",2022,1487149762,185258640,319406209,745179209,12.457295,21.477743,50.107879
53,Entertainment,2022,804408782,160331685,170547096,431472870,19.931618,21.201546,53.638508
59,Facial recognition,2022,62498395,0,0,62498395,0.0,0.0,100.0


In [178]:
# Get maximum entity from United States
max_2022 = df_ai_investment_2022["United States"].max()

# Get minimum entity from United States
min_2022 = df_ai_investment_2022["United States"].min()
print("The maximum entity value is {} and the minimum entity value is {}.".format(max_2022, min_2022))

The maximum entity value is 43848726664 and the minimum entity value is 0.


It seems there is the row Total is interfering with the result we want.

In [179]:
# This is another use of boolean masking
df_ai_investment_new = df_ai_investment_new[df_ai_investment_new["Entity"] != "Total"]
df_ai_investment_2022 = df_ai_investment_2022[df_ai_investment_2022["Entity"] != "Total"]

# Get maximum entity from United States
max_2022 = df_ai_investment_2022["United States"].max()

# Get minimum entity from United States
min_2022 = df_ai_investment_2022["United States"].min()
print("The maximum entity value is {} and the minimum entity value is {}.".format(max_2022, min_2022))


The maximum entity value is 3883317553 and the minimum entity value is 0.


#### Let's get to grouping data! 

In [180]:
# Import new data
df_ai_views = pd.read_csv("data/views-ai-impact-society-next-20-years.csv")
df_ai_views

Unnamed: 0,Entity,Code,Year,Mostly help,Neither,Refused - help/harm question,"Don't have an opinion, Don't know - help/harm question",Mostly harm
0,Afghanistan,AFG,2019,33.451640,2.129547,,40.638863,23.779947
1,Afghanistan,AFG,2021,42.600000,0.700000,,31.900000,24.800000
2,Albania,ALB,2019,17.870370,0.740741,0.092593,32.870370,48.425926
3,Albania,ALB,2021,21.700000,0.200000,0.100000,25.500000,52.500000
4,Algeria,DZA,2019,20.909090,1.000000,0.181818,23.909090,54.000000
...,...,...,...,...,...,...,...,...
233,Vietnam,VNM,2021,59.483616,4.369414,,22.244290,13.902681
234,Zambia,ZMB,2019,24.000000,1.100000,0.100000,34.800000,40.000000
235,Zambia,ZMB,2021,27.000000,1.500000,0.100000,33.500000,37.900000
236,Zimbabwe,ZWE,2019,48.521255,0.924214,,23.382626,27.171904


In [181]:
# Sort by "Mostly help" value
df_sort_help = df_ai_views.sort_values("Mostly help", ascending=False)
df_sort_help

Unnamed: 0,Entity,Code,Year,Mostly help,Neither,Refused - help/harm question,"Don't have an opinion, Don't know - help/harm question",Mostly harm
63,Finland,FIN,2021,72.266400,0.994036,,11.530815,15.208748
193,South Korea,KOR,2021,69.920320,1.095617,0.199203,14.143426,14.641435
192,South Korea,KOR,2019,66.931480,1.489573,,10.625621,20.953327
155,Norway,NOR,2021,63.800000,0.700000,,21.100000,14.400000
51,Denmark,DNK,2021,61.700000,0.200000,,22.100000,16.000000
...,...,...,...,...,...,...,...,...
148,Nicaragua,NIC,2019,14.444445,1.574074,1.111111,35.740740,47.129630
161,Paraguay,PRY,2021,14.185814,0.099900,0.199800,35.264736,50.249752
207,Tanzania,TZA,2021,11.700000,0.200000,0.100000,25.300000,62.700000
99,Jamaica,JAM,2021,11.485148,0.396040,3.564356,54.653465,29.900990


In [182]:
# Filter for rows with "Mostly help" > 50 percent
df_sort_help_top = df_sort_help[df_sort_help["Mostly help"] > 60] 

# Get how many occurences accross 2 years
print("There's a total of {} recorded data where views are 'Mostly help'".format(len(df_sort_help_top)))

# See dataframe
df_sort_help_top

There's a total of 9 recorded data where views are 'Mostly help'


Unnamed: 0,Entity,Code,Year,Mostly help,Neither,Refused - help/harm question,"Don't have an opinion, Don't know - help/harm question",Mostly harm
63,Finland,FIN,2021,72.2664,0.994036,,11.530815,15.208748
193,South Korea,KOR,2021,69.92032,1.095617,0.199203,14.143426,14.641435
192,South Korea,KOR,2019,66.93148,1.489573,,10.625621,20.953327
155,Norway,NOR,2021,63.8,0.7,,21.1,14.4
51,Denmark,DNK,2021,61.7,0.2,,22.1,16.0
101,Japan,JPN,2021,61.48515,5.049505,,20.29703,13.168317
39,China,CHN,2021,61.0,4.314286,0.057143,21.742857,12.885715
199,Sweden,SWE,2021,60.73926,0.2997,,26.473526,12.487513
100,Japan,JPN,2019,60.474308,5.335968,,16.798418,17.391304


In [183]:
# Find which view category have highest share
# Select the relevant columns for comparison
columns_to_compare = ["Mostly help", "Neither", "Refused - help/harm question", "Don't have an opinion, Don't know - help/harm question", "Mostly harm"]

# Create a new column "Highest View" with the column name having the maximum value
df_ai_views["Highest View"] = df_ai_views[columns_to_compare].idxmax(axis=1) # we can use idxmax here with axis=1 meaning columns
df_ai_views

Unnamed: 0,Entity,Code,Year,Mostly help,Neither,Refused - help/harm question,"Don't have an opinion, Don't know - help/harm question",Mostly harm,Highest View
0,Afghanistan,AFG,2019,33.451640,2.129547,,40.638863,23.779947,"Don't have an opinion, Don't know - help/harm ..."
1,Afghanistan,AFG,2021,42.600000,0.700000,,31.900000,24.800000,Mostly help
2,Albania,ALB,2019,17.870370,0.740741,0.092593,32.870370,48.425926,Mostly harm
3,Albania,ALB,2021,21.700000,0.200000,0.100000,25.500000,52.500000,Mostly harm
4,Algeria,DZA,2019,20.909090,1.000000,0.181818,23.909090,54.000000,Mostly harm
...,...,...,...,...,...,...,...,...,...
233,Vietnam,VNM,2021,59.483616,4.369414,,22.244290,13.902681,Mostly help
234,Zambia,ZMB,2019,24.000000,1.100000,0.100000,34.800000,40.000000,Mostly harm
235,Zambia,ZMB,2021,27.000000,1.500000,0.100000,33.500000,37.900000,Mostly harm
236,Zimbabwe,ZWE,2019,48.521255,0.924214,,23.382626,27.171904,Mostly help


In [184]:
# How can we look at each year's total count of each view category?
# Use groupby to group by the year, then use value_counts() on "Highest View"

df_all_views = df_ai_views.groupby("Year")["Highest View"].value_counts()
df_all_views # result is a series

Year  Highest View                                          
2019  Mostly harm                                               66
      Mostly help                                               42
      Don't have an opinion, Don't know - help/harm question    11
2021  Mostly help                                               52
      Mostly harm                                               49
      Don't have an opinion, Don't know - help/harm question    18
Name: count, dtype: int64

In [185]:
# Look at year 2021 only
df_all_views[2021]

Highest View
Mostly help                                               52
Mostly harm                                               49
Don't have an opinion, Don't know - help/harm question    18
Name: count, dtype: int64

In [186]:
# We can also do the same thing with using .agg() which is a method that allows aggregation operations
# Group by "Year" and "Highest View" and find the most frequent "Highest View" category
df_all_views_2 = df_ai_views.groupby(["Year","Highest View"]).agg(view_count = ("Highest View", "count"))
df_all_views_2 # result is a dataframe

Unnamed: 0_level_0,Unnamed: 1_level_0,view_count
Year,Highest View,Unnamed: 2_level_1
2019,"Don't have an opinion, Don't know - help/harm question",11
2019,Mostly harm,66
2019,Mostly help,42
2021,"Don't have an opinion, Don't know - help/harm question",18
2021,Mostly harm,49
2021,Mostly help,52


Let's learn pivot table now!

In [187]:
# Create a pivot table to count occurrences of each category in "Highest View" for each year
df_all_views_3 = df_ai_views.pivot_table(values="Entity", index="Year", columns="Highest View", aggfunc="count", fill_value=0)
df_all_views_3 # result is dataframe

Highest View,"Don't have an opinion, Don't know - help/harm question",Mostly harm,Mostly help
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2019,11,66,42
2021,18,49,52


In [188]:
# Use .loc to select a row
df_all_views_3.loc[2019]

Highest View
Don't have an opinion, Don't know - help/harm question    11
Mostly harm                                               66
Mostly help                                               42
Name: 2019, dtype: int64

#### Now, we're equipped with the basics of working with datasets! 