# Subject: Data Science Foundation

## Session 14 - ArcGIS API for Python.

### Exercise 2 -  Descriptive Statistics using a HTML table to Pandas Data Frame to Portal Item

Let us read the Wikipedia article on List of countries by cigarette consumption per capita. 
This is a list of countries by annual per capita consumption of tobacco cigarettes. 
Explore the dataframe (descriptive statistics and correlation) and creates a map. 

https://en.wikipedia.org/wiki/List_of_countries_by_cigarette_consumption_per_capita

In [1]:
import pandas as pd

In [29]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_by_cigarette_consumption_per_capita")[0]

In [30]:
df.head()

Unnamed: 0,0,1,2
0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per ...
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23


In [31]:
df.columns = df.iloc[0]
df = df.reindex(df.index.drop(0))

In [32]:
df.head()

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23
5,5,Russia,2690.33


Lets check the data structure

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 1 to 182
Data columns (total 3 columns):
Ranking                                                  182 non-null object
Country/Territory                                        182 non-null object
Number of cigarettes per person aged ≥ 15 per year[7]    182 non-null object
dtypes: object(3)
memory usage: 5.7+ KB


In [34]:
df.shape

(182, 3)

Lets find the ranking position of our Country

In [35]:
df[df['Country/Territory'] == 'Portugal']

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
55,55,Portugal,1114.11


Lets check the descriptive statistics

In [36]:
df.describe()

Unnamed: 0,Ranking,Country/Territory,Number of cigarettes per person aged ≥ 15 per year[7]
count,182,182,182.0
unique,182,182,182.0
top,63,India,1619.82
freq,1,1,1.0


Lets rename the columns to prepare the data for a correlation analysis and also for mapping

In [37]:
df.columns = ['Ranking', 'Country', 'Nr_cigar_pp']
df.head()

Unnamed: 0,Ranking,Country,Nr_cigar_pp
1,1,Montenegro,4124.53
2,2,Belarus,3831.62
3,3,Lebanon,3023.15
4,4,Macedonia,2732.23
5,5,Russia,2690.33


We need the "Number of cigarettes per person aged ≥ 15 per year[7]" column (Nrcigar_ppe) in numeric format. Hence let us convert it and while doing so, convert incorrect values to NaN which stands for Not a Number.

In [38]:
df['Nr_cigar_pp'] = df['Nr_cigar_pp'].astype(float)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 182 entries, 1 to 182
Data columns (total 3 columns):
Ranking        182 non-null object
Country        182 non-null object
Nr_cigar_pp    182 non-null float64
dtypes: float64(1), object(2)
memory usage: 5.7+ KB


Repeat for the "Ranking" column

In [41]:
df['Ranking'] = df['Ranking'].astype(int)

Lets calculate the correlation

In [42]:
df.drop(['Country'], axis=1).corr(method='spearman')

Unnamed: 0,Ranking,Nr_cigar_pp
Ranking,1.0,-1.0
Nr_cigar_pp,-1.0,1.0


## Plot as a map

Let us connect to our GIS to geocode this data and present it as a map

In [43]:
from arcgis.gis import GIS
import json

gis = GIS("https://www.arcgis.com", "username", "password")

In [46]:
dat = gis.content.import_data(df, {"CountryCode":"Country"})

In [47]:
map1 = gis.map('Portugal')
map1

Let us us smart mapping to render the points with varying sizes representing the number of Number of cigarettes per person aged ≥ 15 per year

In [50]:
map1.add_layer(dat,  {"renderer":"ClassedSizeRenderer", "field_name": "Nr_cigar_pp"})

Let us publish this layer as a feature collection item in our GIS

In [52]:
item_properties = {
    "title": "Worldwide Number of cigarettes per person aged ≥ 15 per year",
    "tags" : "cigarettes, aged ≥ 15",
    "snippet": " Worldwide Number of cigarettes per person aged ≥ 15 per year",
    "description": "test description",
    "text": json.dumps({"featureCollection": {"layers": [dict(dat.layer)]}}),
    "type": "Feature Collection",
    "typeKeywords": "Data, Feature Collection, Singlelayer",
    "extent" : "-102.5272,-41.7886,172.5967,64.984"
}

item = gis.content.add(item_properties)

Let us search for this item

In [53]:
gis.content.search("Worldwide Number of cigarettes per person aged ≥ 15 per year")

[<Item title:"Worldwide Number of cigarettes per person aged ≥ 15 per year" type:Feature Collection owner:alberto.seabra>]