# Module 4: Exploratory Data Analysis



In [None]:
import pandas as pd
import numpy as np

## intergrating data science life cycle concepts in EDA, link data 8 skills to EDA. 

## Finding Data

Being a data scientists means that we're going to be dealing with a lot of data, but how exactly do we _get_ 
this data? Sometimes you would collect this data yourself through observation or perhaps the data is given
to you like in Data 8, but what if you're starting a new project and you have no data to begin with? In that
case, you would have to find this data yourself. Where would you even begin to start with the task of finding
data?! Thankfully, it's a lot easier now than you think!

One of the first resources that we recommend diving into is [Kaggle](https://www.kaggle.com/datasets) which is
a large central repository of datasets. You can find topics such as [COVID-19](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
and the [MNIST](https://www.kaggle.com/c/digit-recognizer) dataset for more traditional sort of projects or
even fun topics like [Pokemon statistics](https://www.kaggle.com/abcsds/pokemon) and [Titanic](https://www.kaggle.com/c/titanic)
dataset.

Another way to search for datasets is to just [Google](https://www.google.com/) for datasets that you want!
Google even has a [dataset search](https://datasetsearch.research.google.com/) feature built by Google's
research department that can help narrow down results to actual datasets rather than the full collection of
human knowledge that Google provides.

Finally, there are domain specfic dataset sources that you can use. For example, for genome data, one can use
[NCBI](https://www.ncbi.nlm.nih.gov/genome/) to source everything. There's also Reddit's [r/datasets](https://www.reddit.com/r/datasets/)
to further source information and to try and find domain specific knowledge.

### Reading in data files with pandas

include link

The data that you find online can come in several different forms, the most common of which being csv. 
csv files are just comma separated data files, and in pandas, there's a specific function for reading in csv files:

`df = pd.read_csv(file_path)`

Similar to csv files are tsv files. tsv files are the same structure as csv files, but the data is separated by tabs. 
To read in tsv files in pandas, we again use the `read_csv` function:

`pd.read_csv(file_path, sep='\t')`

The sep = '\t' just signifies that the file is seperated by tabs instead of the default ",".

Some other file types you might run into in your search for data are xlsx files, json files, and html files. 
Luckily in pandas, there's a specific function to read each of these files into a Data Frame. 

For xlsx files, use:
`df = pd.read_excel(file_path)`

For json files, use:
`df = pd.read_json(file_path)`

For html files, use:
`df = pd.read_html(file_path)`

## Exploratory Data Analysis (EDA)

The first step of EDA is determining the "shape" of our data. 
In general, rectangular structures tend to be easier to manipulate and analyze, and data cleaning often involves getting our data into a rectangular shape.



In this section, we will be working with the [2016 Global Ecological Footprint](https://www.kaggle.com/footprintnetwork/ecological-footprint) dataset from Kaggle. 
The following description can be found from the website:

*The ecological footprint measures the ecological assets that a given population requires to produce the natural resources it consumes (including plant-based food and fiber products, livestock and fish products, timber and other forest products, space for urban infrastructure) and to absorb its waste, especially carbon emissions. The footprint tracks the use of six categories of productive surface areas: cropland, grazing land, fishing grounds, built-up (or urban) land, forest area, and carbon demand on land.*

*A nation’s biocapacity represents the productivity of its ecological assets, including cropland, grazing land, forest land, fishing grounds, and built-up land. These areas, especially if left unharvested, can also absorb much of the waste we generate, especially our carbon emissions.*

*Both the ecological footprint and biocapacity are expressed in global hectares — globally comparable, standardized hectares with world average productivity.*

*If a population’s ecological footprint exceeds the region’s biocapacity, that region runs an ecological deficit. Its demand for the goods and services that its land and seas can provide — fruits and vegetables, meat, fish, wood, cotton for clothing, and carbon dioxide absorption — exceeds what the region’s ecosystems can renew. A region in ecological deficit meets demand by importing, liquidating its own ecological assets (such as overfishing), and/or emitting carbon dioxide into the atmosphere. If a region’s biocapacity exceeds its ecological footprint, it has an ecological reserve.*

In [None]:
countries = pd.read_csv('countries.csv')
countries

Unnamed: 0,Country,Region,Population (millions),HDI,GDP per Capita,Cropland Footprint,Grazing Footprint,Forest Footprint,Carbon Footprint,Fish Footprint,...,Cropland,Grazing Land,Forest Land,Fishing Water,Urban Land,Total Biocapacity,Biocapacity Deficit or Reserve,Earths Required,Countries Required,Data Quality
0,Afghanistan,Middle East/Central Asia,29.82,0.46,$614.66,0.30,0.20,0.08,0.18,0.00,...,0.24,0.20,0.02,0.00,0.04,0.50,-0.30,0.46,1.60,6
1,Albania,Northern/Eastern Europe,3.16,0.73,"$4,534.37",0.78,0.22,0.25,0.87,0.02,...,0.55,0.21,0.29,0.07,0.06,1.18,-1.03,1.27,1.87,6
2,Algeria,Africa,38.48,0.73,"$5,430.57",0.60,0.16,0.17,1.14,0.01,...,0.24,0.27,0.03,0.01,0.03,0.59,-1.53,1.22,3.61,5
3,Angola,Africa,20.82,0.52,"$4,665.91",0.33,0.15,0.12,0.20,0.09,...,0.20,1.42,0.64,0.26,0.04,2.55,1.61,0.54,0.37,6
4,Antigua and Barbuda,Latin America,0.09,0.78,"$13,205.10",,,,,,...,,,,,,0.94,-4.44,3.11,5.70,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183,Viet Nam,Asia-Pacific,90.80,0.66,"$1,532.31",0.50,0.01,0.19,0.79,0.05,...,0.55,0.01,0.17,0.16,0.10,1.00,-0.65,0.95,1.66,6
184,Wallis and Futuna Islands,Asia-Pacific,0.01,,,,,,,,...,,,,,,1.51,-0.56,1.19,1.37,3T
185,Yemen,Middle East/Central Asia,23.85,0.50,"$1,302.30",0.34,0.14,0.04,0.42,0.04,...,0.09,0.12,0.04,0.20,0.04,0.50,-0.53,0.59,2.06,5
186,Zambia,Africa,14.08,0.58,"$1,740.64",0.19,0.18,0.33,0.24,0.01,...,0.24,0.94,0.99,0.02,0.04,2.23,1.24,0.57,0.44,6


In [None]:
countries.shape

(188, 21)

We see that the `countries` dataset has `188` rows and `21` columns, so our dataset is already rectangular. Next, let's look at the columns and their data types.

We can see the type of each column by calling `df.dtypes`

In [None]:
countries.dtypes

Country                            object
Region                             object
Population (millions)             float64
HDI                               float64
GDP per Capita                     object
Cropland Footprint                float64
Grazing Footprint                 float64
Forest Footprint                  float64
Carbon Footprint                  float64
Fish Footprint                    float64
Total Ecological Footprint        float64
Cropland                          float64
Grazing Land                      float64
Forest Land                       float64
Fishing Water                     float64
Urban Land                        float64
Total Biocapacity                 float64
Biocapacity Deficit or Reserve    float64
Earths Required                   float64
Countries Required                float64
Data Quality                       object
dtype: object

We can see that most of the data types are what we expect them to be. Categorial variables like `Region` are represented as "objects" in pandas, which are effectively the same as strings.

We also see that the `GDP per Capita` column is also of type `object`, but it should be a numeric variable because it is a value that we can calculate, sort, and apply numeric functions on. 

Let's look at why it is an `object` and not a numeric type.


In [None]:
countries['GDP per Capita']

0         $614.66
1       $4,534.37
2       $5,430.57
3       $4,665.91
4      $13,205.10
          ...    
183     $1,532.31
184           NaN
185     $1,302.30
186     $1,740.64
187       $865.91
Name: GDP per Capita, Length: 188, dtype: object

Interestingly, we see that `GDP per Capita` is represented as a string which goes against our intuition that GDP is a numeric quantity. 

In [None]:
countries['GDP per Capita'].head()

0       $614.66
1     $4,534.37
2     $5,430.57
3     $4,665.91
4    $13,205.10
Name: GDP per Capita, dtype: object

We see that the values contain the '$' symbol and the ',' symbol. We need to get rid of these strings and convert the data to a numeric type such as `float`.

Pandas has a function to replace patterns in strings. The `.str.replace(pat, repl)` method searches for all instances of `pat` in a series and replaces it with `repl`.

The expression below looks for all instances of `,` and `$` and replaces it with the empty string.

In [None]:
countries['GDP per Capita'] = countries['GDP per Capita'].str.replace(pat=',', repl = '').str.replace('$', '')
countries['GDP per Capita']

0        614.66
1       4534.37
2       5430.57
3       4665.91
4      13205.10
         ...   
183     1532.31
184         NaN
185     1302.30
186     1740.64
187      865.91
Name: GDP per Capita, Length: 188, dtype: object

Despite removing the problematic symbols, our series is still of type `object`. 

To convert it to a float, we can use the function `series.astype(type)`.

In [None]:
countries['GDP per Capita'] = countries['GDP per Capita'].astype(float)
countries['GDP per Capita']

0        614.66
1       4534.37
2       5430.57
3       4665.91
4      13205.10
         ...   
183     1532.31
184         NaN
185     1302.30
186     1740.64
187      865.91
Name: GDP per Capita, Length: 188, dtype: float64

In [None]:
countries

ImportError: cannot import name '_is_url' from 'pandas.io.common' (/opt/venv/lib/python3.7/site-packages/pandas/io/common.py)

                       Country                    Region  \
0                  Afghanistan  Middle East/Central Asia   
1                      Albania   Northern/Eastern Europe   
2                      Algeria                    Africa   
3                       Angola                    Africa   
4          Antigua and Barbuda             Latin America   
..                         ...                       ...   
183                   Viet Nam              Asia-Pacific   
184  Wallis and Futuna Islands              Asia-Pacific   
185                      Yemen  Middle East/Central Asia   
186                     Zambia                    Africa   
187                   Zimbabwe                    Africa   

     Population (millions)   HDI GDP per Capita  Cropland Footprint  \
0                    29.82  0.46        $614.66                0.30   
1                     3.16  0.73      $4,534.37                0.78   
2                    38.48  0.73      $5,430.57                0.6

## Exploratory Data Analysis: Granularity

*Granularity* refers to the scale or level of detail present in a dataset.
For example, for the `countries` dataframe, the smallest "piece" of information that we are concerned with is on the country level. 

Different levels that we may deal with could be data on the continental level, data on a province/state level, or even data on a city/zip code level.
These are all different levels of granularity that may appear in data. The country level is large enough to deal with a world
view but if we are concerned with regional changes in certain data, the country scale would be too large to deal with.

Granularity of your data is important to determine whether or not a dataset is a good fit for your project.

In the case of the `countries` dataset, we can elevate our level of granularity up to the region or continental level
by grouping by region and summing up all of our numerical fields. Notice how we can move *up* in granularity but
we typically cannot move *down* in granularity. Our lowest level of granularity cannot be further reduced
unless we have new data to augment what we already have.

In [None]:
countries.groupby(["Region"]).sum()  # Elevation of our granularity to the continental level

ImportError: cannot import name '_is_url' from 'pandas.io.common' (/opt/venv/lib/python3.7/site-packages/pandas/io/common.py)

                          Population (millions)        HDI  \
Region                                                       
Africa                                 1034.640  25.270000   
Asia-Pacific                           3880.170  19.250000   
European Union                          503.980  22.480000   
Latin America                           605.410  23.053846   
Middle East/Central Asia                405.586  16.720000   
North America                           352.400   1.820000   
Northern/Eastern Europe                 238.180   9.460000   

                          Cropland Footprint  Grazing Footprint  \
Region                                                            
Africa                                 19.48              11.48   
Asia-Pacific                           17.96               7.97   
European Union                         23.59               6.25   
Latin America                          14.85              12.57   
Middle East/Central Asia               

## EDA: Scope

*Scope* on the other hand is concerned with what *kind* of data your dataset contains. For example, the `countries`
dataframe has data concerning ecological footprint and GDP which is useful on a project regarding ecological impact
by country since this is our dataset's specialty or scope. However, if we were doing a project regarding the impact
countries have on the global economy, the `countries` dataframe is rather ill-suited to carry out this task. We would
want our dataset's scope to instead be centered around financial institutions for each country and so on.

## EDA: Temporality

*Temporality* is a concept referring to how the data is situated in regards to time. This can involve the time/date the data was collected or entered into the database. It can also be important to consider location in some cases, as different time zones can get tricky.

One important thing to note is that this dataset contains values from 2016. Depending on your need, you may want to search for more current data. 
If that is not available, it is a good practice to note in your research why you had to work with 2016 data.

While exploring any dataset, one thing we need to keep in mind is when the data is collected. 

If the dataset is not up to date, then:

1. Are our findings still valid/useful? 
2. What changes could have made an impact on the findings of your research from when the data was collected till present?
3. Is there a way we can account for those changes and their impact in our own research?
4. Was is collected in the same time frame we wish to study from?

Temporality is always something we should keep in mind while doing research, no matter what patterns the data discloses. 

## EDA: Faithfulness

*Faithfulness* has to do with the reliability and accuracy of the data at hand. This can include errors as small as typos or shifted fields, or go as far as data falsification and incorrect values.

Some things to consider when guaging the reliability of the data:
- Does my data contain unrealistic or “incorrect” values?
    - Dates in the future for events in the past
    - Locations that don’t exist
    - Negative counts
- Was the data entered by hand?
    - Spelling errors, fields shifted …
    - Did the form require fields or provide default values?
- Does my data violate obvious dependencies?
    - E.g., age and birthday don’t match
- Are there obvious signs of data falsification?
    - Repeated names, fake looking email addresses, repeated use of uncommon names or fields.

Let's have explore the `Data Quality` column of the dataset. First, we will look at the values are in the column and their frequency.

In [None]:
countries["Data Quality"].value_counts()

5     66
6     60
3B    29
3L    18
3T     7
2      6
4      2
Name: Data Quality, dtype: int64

If we go to the source of this data - [2016 Global Ecological Footprint](https://www.kaggle.com/footprintnetwork/ecological-footprint) dataset from Kaggle - there is no description of what the values on this column mean. 

1. Does it rank the data quality of the countries? 
2. What does each value represent? 
3. Are lower or higher values better? 

And so on. There are many questions that we cannot answer.

If we explore further, there is a [post](https://www.kaggle.com/footprintnetwork/ecological-footprint/discussion/74703) in the discussion tab of kaggle website. Take a minute to navigate to the post and read it.

There seems to be some inconsistency with the data that we cannot explain yet. As it stands, the 'faithfulness' of the data is in question. 

From here, depending on the situation, we can either ignore this column of data, research further about what the data means, or perform some analysis on this data but with a note explaing the faithfulness problem.