# Subject: Data Science Foundation

## Session 14 - ArcGIS API for Python.

### Exercise 2 -  Descriptive Statistics using a HTML table to Pandas Data Frame to Portal Item

Let us read the Wikipedia article on List of countries by cigarette consumption per capita. 
This is a list of countries by annual per capita consumption of tobacco cigarettes. 
Explore the dataframe (descriptive statistics and correlation) and creates a map. 

https://en.wikipedia.org/wiki/List_of_countries_by_cigarette_consumption_per_capita

In [1]:
import pandas as pd

In [2]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_cigarette_consumption_per_capita")[0]

In [3]:
df.head()

Unnamed: 0,0,1,2
0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per ...
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23


In [4]:
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))

In [5]:
df.head()

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23
5,5,Russia,2690.33


Lets check the data structure

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 1 to 182
Data columns (total 3 columns):
Ranking                                                  182 non-null object
Country/Territory                                        182 non-null object
Number of cigarettes per person aged ≥ 15 per year[7]    182 non-null object
dtypes: object(3)
memory usage: 5.7+ KB


In [7]:
df.shape

(182, 3)

Lets find the ranking position of our Country

In [8]:
df.loc[df['Country/Territory'] == "Spain"]

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
47,47,Spain,1264.74


In [9]:
df.loc[df['Country/Territory'] == "China"]

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
9,9,China,2249.79


In [10]:
df["Country/Territory"].rank

<bound method NDFrame.rank of 1                            Montenegro
2                               Belarus
3                               Lebanon
4                             Macedonia
5                                Russia
6                              Slovenia
7                               Belgium
8                            Luxembourg
9                                 China
10               Bosnia and Herzegovina
11                       Czech Republic
12                           Kazakhstan
13                           Azerbaijan
14                               Greece
15                          South Korea
16                              Austria
17                               Jordan
18                              Ukraine
19                              Hungary
20                              Estonia
21                                Japan
22                              Croatia
23                               Serbia
24                               Cyprus
25        

Lets check the descriptive statistics

In [11]:
df.describe()

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
count,182,182,182.0
unique,182,182,182.0
top,3,Denmark,1024.09
freq,1,1,1.0


Lets rename the columns to prepare the data for a correlation analysis and also for mapping

In [12]:
df.rename(columns={'Ranking':'Ranking','Country/Territory':'Country','Number of cigarettes per person aged ≥ 15 per year[7]':'Cigarettes consumption PP'})

Unnamed: 0,Ranking,Country,Cigarettes consumption PP
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23
5,5,Russia,2690.33
6,6,Slovenia,2637.03
7,7,Belgium,2353.28
8,8,Luxembourg,2283.
9,9,China,2249.79
10,10,Bosnia and Herzegovina,2233.46


We need the "Number of cigarettes per person aged ≥ 15 per year[7]" column (Nrcigar_ppe) in numeric format. Hence let us convert it and while doing so, convert incorrect values to NaN which stands for Not a Number.

In [13]:
converted_NOci = pd.to_numeric(df["Number of cigarettes per person aged ≥ 15 per year[7]"], errors = 'coerce') 
df['Nrcigar_pp'] = converted_NOci
df.head()

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7],Nrcigar_pp
1,1,Montenegro,4124.53,4124.53
2,2,Belarus,3831.62,3831.62
3,3,Lebanon,3023.15,3023.15
4,4,Macedonia,2732.23,2732.23
5,5,Russia,2690.33,2690.33


Repeat for the "Ranking" column

In [14]:
converted_Rank = pd.to_numeric(df["Ranking"], errors = 'coerce') 
df['Nr_Rank'] = converted_Rank
df.head()

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7],Nrcigar_pp,Nr_Rank
1,1,Montenegro,4124.53,4124.53,1
2,2,Belarus,3831.62,3831.62,2
3,3,Lebanon,3023.15,3023.15,3
4,4,Macedonia,2732.23,2732.23,4
5,5,Russia,2690.33,2690.33,5


Lets calculate the correlation

In [15]:
df.dtypes

0
Ranking                                                   object
Country/Territory                                         object
Number of cigarettes per person aged ≥ 15 per year[7]     object
Nrcigar_pp                                               float64
Nr_Rank                                                    int64
dtype: object

In [16]:
df.corr(method='pearson')

Unnamed: 0_level_0,Nrcigar_pp,Nr_Rank
0,Unnamed: 1_level_1,Unnamed: 2_level_1
Nrcigar_pp,1.0,-0.91051
Nr_Rank,-0.91051,1.0


## Plot as a map

Let us connect to our GIS to geocode this data and present it as a map

In [17]:
from arcgis.gis import GIS
import json

gis = GIS("https://www.arcgis.com", "christynotsoso", "Bts12345!")

## Remark: couldnt do the following because the trial is expired....

In [18]:
fc = gis.content.import_data(df, {"CountryCode":"Country"})a

Subscription is disabled


RuntimeError: Subscription is disabled
(Error Code: 403)

Let us us smart mapping to render the points with varying sizes representing the number of Number of cigarettes per person aged ≥ 15 per year

> Put your code here

Let us publish this layer as a feature collection item in our GIS

> Put your code here

Let us search for this item

> Put your code here