> Cohort Number :  *33*                      
> Lecture Number : *I*                
> Author : *Jithin J Kumar*  
> Topic : **Introduction to Data, Tidy Data** <br>
> Date : 07-06-2020, Tuesday

# Types of Data
-----

The Two Main Flavors of Data: **Qualitative** and **Quantitative** </br>
At the highest level, two kinds of data exist: quantitative and qualitative.

**Quantitative data** deals with numbers and things you can measure objectively: dimensions such as height, width, and length. Temperature and humidity. Prices. Area and volume.

**Qualitative data** deals with characteristics and descriptors that can't be easily measured, but can be observed subjectively—such as smells, tastes, textures, attractiveness, and color. 

Broadly speaking, when you measure something and give it a number value, you create quantitative data. When you classify or judge something, you create qualitative data. So far, so good. But this is just the highest level of data: there are also different types of quantitative and qualitative data.

## Quantitative Data
----

There are two types of quantitative data, which is also referred to as numeric data: **continuous** and **discrete**.</br>
As a general rule, counts are discrete and measurements are continuous.

**Discrete data** is a count that can't be made more precise. Typically it involves integers. For instance, the number of children (or adults, or pets) in your family is discrete data, because you are counting whole, indivisible entities: you can't have 2.5 kids, or 1.3 pets.

**Continuous data**, on the other hand, could be divided and reduced to finer and finer levels. For example, you can measure the height of your kids at progressively more precise scales—meters, centimeters, millimeters, and beyond—so height is continuous data.

If I tally the number of individual jellys in a box, that number is a piece of discrete data.
If I use a scale to measure the weight of each Jujube, or the weight of the entire box, that's continuous data. 

![jellybeans](https://blog.minitab.com/hubfs/Imported_Blog_Media/jujubes_count_tally.jpg)

![jelly_wieght](https://blog.minitab.com/hubfs/Imported_Blog_Media/jujube_weight_continuous_data.jpg)

## Qualitative Data
-----

When you classify or categorize something, you create Qualitative or attribute data. There are three main kinds of qualitative data.

**Binary data** place things in one of two mutually exclusive categories: right/wrong, true/false, or accept/reject. 

Occasionally, I'll get a box of jelly that contains a couple of individual pieces that are either too hard or too dry. If I went through the box and classified each piece as "Good" or "Bad," that would be binary data. I could use this kind of data to develop a statistical model to predict how frequently I can expect to get a bad jelly.

When collecting **unordered or nominal data**, we assign individual items to named categories that do not have an implicit or natural value or rank. If I went through a box of jelly and recorded the color of each in my worksheet, that would be **nominal data**. 

We also can have **ordered or ordinal data**, in which items are assigned to categories that do have some kind of implicit or natural order, such as "Short, Medium, or Tall."  Another example is a survey question that asks us to rate an item on a 1 to 10 scale, with 10 being the best. This implies that 10 is better than 9, which is better than 8, and so on. 

# Tidy Data
---

> Complete Reading paper </br>
[Tidy Data by Hadley Wickham](https://vita.had.co.nz/papers/tidy-data.pdf)

1) Analyse. Explain the insights that you can get from *'gapminder'* data using pandas. <br>
2) Clean the data set *pew* dataset. Prepare the dataset in long format as per Tidy Data principles <br>
3) Clean *'billboard'* dataset as per tidy data principles. <br>
4) Clean *'country_timeseries'* dataset (ebola dataset) as per tidy data principles. <br>

## Answers

1) Analysing Gapminder Dataset.

In [1]:
import pandas as pd

In [10]:
df = pd.read_csv('./assignments/gapminder.tsv', sep='\t')

In [7]:
df.describe()

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,59.474439,29601210.0,7215.327081
std,17.26533,12.917107,106157900.0,9857.454543
min,1952.0,23.599,60011.0,241.165877
25%,1965.75,48.198,2793664.0,1202.060309
50%,1979.5,60.7125,7023596.0,3531.846989
75%,1993.25,70.8455,19585220.0,9325.462346
max,2007.0,82.603,1318683000.0,113523.1329


In [13]:
# Per Country 
df[df['country']=='India']

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
696,India,Asia,1952,37.373,372000000,546.565749
697,India,Asia,1957,40.249,409000000,590.061996
698,India,Asia,1962,43.605,454000000,658.347151
699,India,Asia,1967,47.193,506000000,700.770611
700,India,Asia,1972,50.651,567000000,724.032527
701,India,Asia,1977,54.208,634000000,813.337323
702,India,Asia,1982,56.596,708000000,855.723538
703,India,Asia,1987,58.553,788000000,976.512676
704,India,Asia,1992,60.223,872000000,1164.406809
705,India,Asia,1997,61.765,959000000,1458.817442


In [20]:
# per continent
df[df['continent']=='Americas'].describe()

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,300.0,300.0,300.0,300.0
mean,1979.5,64.658737,24504790.0,7136.110356
std,17.289102,9.345088,50979430.0,6396.764112
min,1952.0,37.579,662850.0,1201.637154
25%,1965.75,58.41,2962359.0,3427.779072
50%,1979.5,67.048,6227510.0,5465.509853
75%,1993.25,71.6995,18340310.0,7830.210416
max,2007.0,80.653,301139900.0,42951.65309


In [19]:
df['continent'].unique()

array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)

In [97]:
gdp_new = list(df[df['country']=='Pakistan']['gdpPercap'][1:])

In [98]:
temp_df = df[df['country']=='Pakistan']
temp_df = temp_df[:-1]
temp_df['gdp_new'] = gdp_new

In [99]:
temp_df['growth']=100*(temp_df['gdp_new'] - temp_df['gdpPercap'])/temp_df['gdpPercap']

In [104]:
def gdpPercap_growth(country):
    gdp_new = list(df[df['country']==country]['gdpPercap'][1:])
    temp_df = df[df['country']==country]
    temp_df = temp_df[:-1]
    temp_df['gdp_new'] = gdp_new
    temp_df['growth']=100*(temp_df['gdp_new'] - temp_df['gdpPercap'])/temp_df['gdpPercap']
    return temp_df[['year','growth']]

gdpPercap_growth('Nepal')    

Unnamed: 0,year,growth
1068,1952,9.539092
1069,1957,9.108077
1070,1962,3.685696
1071,1967,-0.244529
1072,1972,2.86376
1073,1977,3.495205
1074,1982,7.970699
1075,1987,15.743012
1076,1992,12.604065
1077,1997,4.581515


In [117]:
# Sorting
df.sort_values(by=['year','lifeExp'], ascending=True)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
552,Gambia,Africa,1952,30.000,284320,485.230659
36,Angola,Africa,1952,30.015,4232095,3520.610273
1344,Sierra Leone,Africa,1952,30.331,2143249,879.787736
1032,Mozambique,Africa,1952,31.286,6446316,468.526038
...,...,...,...,...,...,...
71,Australia,Oceania,2007,81.235,20434176,34435.367440
1487,Switzerland,Europe,2007,81.701,7554661,37506.419070
695,Iceland,Europe,2007,81.757,301931,36180.789190
671,"Hong Kong, China",Asia,2007,82.208,6980412,39724.978670


In [149]:
# pivot tables
temp_df = df[df['continent']=='Asia'].pivot_table(index=['continent','country'],columns='year',values='gdpPercap', aggfunc='mean')
temp_df

Unnamed: 0_level_0,year,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
continent,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Asia,Afghanistan,779.445314,820.85303,853.10071,836.197138,739.981106,786.11336,978.011439,852.395945,649.341395,635.341351,726.734055,974.580338
Asia,Bahrain,9867.084765,11635.79945,12753.27514,14804.6727,18268.65839,19340.10196,19211.14731,18524.02406,19035.57917,20292.01679,23403.55927,29796.04834
Asia,Bangladesh,684.244172,661.637458,686.341554,721.186086,630.233627,659.877232,676.981866,751.979403,837.810164,972.770035,1136.39043,1391.253792
Asia,Cambodia,368.469286,434.038336,496.913648,523.432314,421.624026,524.972183,624.475478,683.895573,682.303175,734.28517,896.226015,1713.778686
Asia,China,400.448611,575.987001,487.674018,612.705693,676.900092,741.23747,962.42138,1378.904018,1655.784158,2289.234136,3119.280896,4959.114854
Asia,"Hong Kong, China",3054.421209,3629.076457,4692.648272,6197.962814,8315.928145,11186.14125,14560.53051,20038.47269,24757.60301,28377.63219,30209.01516,39724.97867
Asia,India,546.565749,590.061996,658.347151,700.770611,724.032527,813.337323,855.723538,976.512676,1164.406809,1458.817442,1746.769454,2452.210407
Asia,Indonesia,749.681655,858.900271,849.28977,762.431772,1111.107907,1382.702056,1516.872988,1748.356961,2383.140898,3119.335603,2873.91287,3540.651564
Asia,Iran,3035.326002,3290.257643,4187.329802,5906.731805,9613.818607,11888.59508,7608.334602,6642.881371,7235.653188,8263.590301,9240.761975,11605.71449
Asia,Iraq,4129.766056,6229.333562,8341.737815,8931.459811,9576.037596,14688.23507,14517.90711,11643.57268,3745.640687,3076.239795,4390.717312,4471.061906


In [150]:
temp_df.reset_index(inplace=True)

In [152]:
temp_df.head(10)

year,continent,country,1952,1957,1962,1967,1972,1977,1982,1987,1992,1997,2002,2007
0,Asia,Afghanistan,779.445314,820.85303,853.10071,836.197138,739.981106,786.11336,978.011439,852.395945,649.341395,635.341351,726.734055,974.580338
1,Asia,Bahrain,9867.084765,11635.79945,12753.27514,14804.6727,18268.65839,19340.10196,19211.14731,18524.02406,19035.57917,20292.01679,23403.55927,29796.04834
2,Asia,Bangladesh,684.244172,661.637458,686.341554,721.186086,630.233627,659.877232,676.981866,751.979403,837.810164,972.770035,1136.39043,1391.253792
3,Asia,Cambodia,368.469286,434.038336,496.913648,523.432314,421.624026,524.972183,624.475478,683.895573,682.303175,734.28517,896.226015,1713.778686
4,Asia,China,400.448611,575.987001,487.674018,612.705693,676.900092,741.23747,962.42138,1378.904018,1655.784158,2289.234136,3119.280896,4959.114854
5,Asia,"Hong Kong, China",3054.421209,3629.076457,4692.648272,6197.962814,8315.928145,11186.14125,14560.53051,20038.47269,24757.60301,28377.63219,30209.01516,39724.97867
6,Asia,India,546.565749,590.061996,658.347151,700.770611,724.032527,813.337323,855.723538,976.512676,1164.406809,1458.817442,1746.769454,2452.210407
7,Asia,Indonesia,749.681655,858.900271,849.28977,762.431772,1111.107907,1382.702056,1516.872988,1748.356961,2383.140898,3119.335603,2873.91287,3540.651564
8,Asia,Iran,3035.326002,3290.257643,4187.329802,5906.731805,9613.818607,11888.59508,7608.334602,6642.881371,7235.653188,8263.590301,9240.761975,11605.71449
9,Asia,Iraq,4129.766056,6229.333562,8341.737815,8931.459811,9576.037596,14688.23507,14517.90711,11643.57268,3745.640687,3076.239795,4390.717312,4471.061906


In [154]:
temp_df.melt(id_vars=['continent','country'])

Unnamed: 0,continent,country,year,value
0,Asia,Afghanistan,1952,779.445314
1,Asia,Bahrain,1952,9867.084765
2,Asia,Bangladesh,1952,684.244172
3,Asia,Cambodia,1952,368.469286
4,Asia,China,1952,400.448611
...,...,...,...,...
391,Asia,Taiwan,2007,28718.276840
392,Asia,Thailand,2007,7458.396327
393,Asia,Vietnam,2007,2441.576404
394,Asia,West Bank and Gaza,2007,3025.349798
