# Big Data Processes - exercise no.2:
## <font color= green> Importing, exploring and visualising data </font> 

## Where to get datasets? 

https://www.gesis.org/en/home GESIS has many different types of surveys from Europe. Including Eurobarameter (which has many different topics and the European Value survey that is deployed every seven years. Create a login, go to services and check out the different types of surveys they have done under Finding and accessing data. 

https://www.kaggle.com/datasets Through Kaggle you will have access to over 50.000 public datasets 

Other possible sources for datasets include data.world and Google's dataset search

https://data.world/search?q=type%3Adataset&type=resources

https://datasetsearch.research.google.com/

Scraping data using an API is another possibility, more on this later in the exerciese session!

### 1. Import and load


Import pandas

In [1]:
!pip install pandas
import pandas as pd



If you are working with a dataset from GESIS it will come in a SAV file. To open a SAV file import pyreadstat as shown below, and create your dataframe

In [2]:
!pip install pyreadstat

import pyreadstat



Below put in your own path to the file

In [None]:
dataset = pd.read_spss(r"C:\Users\Simone\OneDrive\4th semester DIM\BDP TA\Exercise 2\value_survey.sav")

Working with a dataset from GESIS can look a bit overwhelming, but through GESIS you will have access to documents explaining what the different columns mean, and how the survey questions were worded. 

In [None]:
dataset

We will continue with a less overwhelming dataset and go through some more basic functions

In [None]:
df = pd.read_csv('Auto-miss.csv', delimiter=';')

In [None]:
df

### 2. Open the dataset and examine it

Using the .head() and .tail() functions to examine the dataset (like we saw last time)

the .head() function shows the five first rows of the dataset. You can change number of rows shown by inserting a number
#in the parentheses

In [None]:
df.head()

the .tail() function shows the five last rows of the dataset.

In [None]:
df.tail()

We can get information on the datatype in each column through the .dtypes function. 

In [None]:
df.dtypes

int64 means that the data type of the column is an integer (a number without decimals)

object means that the data type of that column is either string (line of text) or the column has a mixed of data types in it.

float64 means that the data type of that column is a float (a number WITH decimals)

As we can see the data type for the column horsepower is a float, whereas for acceleration it is an object (string or mixed). This is because the numbers in the acceleration column uses commas.

Information on the column names through .columns

In [None]:
df.columns

You can get this information and more through the **.info()** function

In [None]:
df.info()

What important information can you see in the **.info()** matrix above:
- range length and index
- columns: name, count, null, type
- datatypes list - count per type
- What columns are missing data

For example the weight column only has 385 entries, whereas most of them have 392 entries

### 3. Exploration with summary statistics

We can use the **.describe()** function to gain basic descriptive statistics about each column where it is applicable. 
For the **.describe()** function the 50% represents the median. However, just because it is applicable for some of the columns does not mean that it creates useful results for all the columns where it as applicable

In [None]:
df.describe()

### 4. Missing data

Not all datasets are complete and contains missing values. It is important to locate any missing values in a dataset in order to later handle them

We first count the values, which can be counted

In [None]:
df.count()

The **.isna()** (Is NA?) function can be used to locate the missing values. It returns a boolean (True or False) for each datapoint in the dataset

All the places where it says True in the dataset, then that is missing data values. 


In [None]:
df.isna()

We will learn how to handle missing data in another week

### 5. Filtering (specific rows)

Looking at all the cars is maybe too much so we can focus on one specific car in the dataset

dataset['column name'] returns only the specific column of the dataset 

In [None]:
df['name']

Below we can see how to select rows based on a specific value

In [None]:
df.loc[df['name'] == 'vw pickup']

Or all the cars with ford in the name

In [None]:
df.loc[df['name'].str.contains(r'ford')]

You can also filter based on the row number and not on any name.

Here we want to find out what is in row 48 of the dataset


In [None]:
df.iloc[[48]]

### 6. Combining datasets

Sometimes, the datasets we find do not contain enough information for our project. Consequently, we need to merge datasets in order to construct a dataset which contains all the necessary information.

We import dataset containing information about the colour of the cars

In [None]:
colour = pd.read_csv('car-colour.csv', delimiter=';')

In [None]:
colour

In the colour dataset there is a column named obsNo, we can use that column to merge the Colour column with the car dataset. Because in the cars dataset there is a column called obsNo as well. 

There are different ways of merging two dataframes, choosing "outer" is similar to a full outer join in SQL:
https://www.w3schools.com/sql/sql_join_full.asp

In [None]:
carcolour= pd.merge(df, colour, on=['obsNo'], how='outer')

In [None]:
carcolour

Another way to combine two datasets is the **.concat()**

In [None]:
cardata=[df,colour]
carcolour2=pd.concat(cardata, sort=False, join='outer')

In [None]:
carcolour2

##### Question 1: 
How has concat combined the two datasets in comparison to merge?

You can also combine two datasets by using **.join()**

In [None]:
df.join(colour, on='obsNo',rsuffix="_r")

##### Question 2:
Why do you think that the last 6 rows show NaN for obsNo_r and colour?

In [None]:
#let us look at the rows between row 120 and 130 (130 excluded)
df.iloc[120:130]

Look at obsNo 127? What is wrong? 

#### Exercise 1:

Locate the dataset on who eats the food we grow on Kaggle and load it into a variable called food. Show the first 10 rows of the dataset

In [None]:
#write your code here:


#### Exercise 2:
What is the data unit in the food dataset? The data unit is the instace that is described by all the variables in the dataset

#### Exercise 3:

Locate the dataset on One-Sided violence "Government of Angola - Civilians" in the UCDP (Uppsala Conflict Data Programme) and load it into a variable called df (alias for dataframe). Find out the datatype in each column and the name of each column 

In [None]:
#write your code here:


#### Exercise 4:

How does the UCDP collect their data on conflicts and how may this impact research based on this data? 

#### Exercise 5: 

Which additional datasets, reports etc. could prove beneficial for a research project based on the dataset on One-Sided Violence "Government of Angola - Civilians"? Describe how you would use the additional material in a project

#### Exercise 6:

Load the dataset "car-maxspeed.csv" into a variable called max-speed

In [None]:
#write your code here


#### Exercise 7:

Combine the "carcolour" dataset with the "max-speed" dataset in a new dataset called "carcolourspeed" 

In [None]:
#write your code here


#### Exercise 8:

locate all cars with "chevrolet" in the name in the "carcolourspeed" dataset and print them

In [None]:
#write your code here
