# Project 1: Exploratory Data Analysis - EDA

<br>

<div style="text-align: justify">You are required to select your own dataset from:</div>

[https://www.kaggle.com/datasets](https://www.kaggle.com/datasets)

<br>

<div style="text-align: justify">Once you have selected your dataset, please inform your instructor as each student must utilize a unique dataset.</div>

<br>

<div style="text-align: justify">Here are the steps (not necessary to be followed in order) to be shown in your work: <b>[50 marks]</b></div>

1. Understanding the data: Import the data. Start by getting a basic understanding of the data you're working with, such as the size of the dataset, data types, and data structure.

2. Cleaning the data: Identify and handle any missing or corrupted data. This may involve imputing missing values, removing duplicates, or correcting errors.

3. Visualizing the data: Use visualizations such as histograms, box plots, scatter plots, and heat maps to gain insights into the distribution, variability, and relationships between variables. Make sure to identify the type of visualization techniques that you have used, i.e., univariate, bivariate or multivariate.

4. Analyzing relationships: Explore correlations and dependencies between variables using statistical measures such as correlation coefficients.

5. Identifying anomalies: Look for any unusual or unexpected patterns or outliers that may indicate errors or interesting phenomena (if any).

6. Summarizing the data: Based on the insights gained from the data exploration, formulate hypotheses about the relationships between variables. Finally, communicate the insights gained from the EDA using clear and effective narrative summaries.

<br>

## Submission Instructions:

<br>

<div style="text-align: justify">Dateline to submit in Brighten: <b>14th April 2023, 11.59pm</b> </div>

<br>

<div style="text-align: justify">Print out the entire notebook in pdf and submit ONLY pdf copy in Brighten.</div>

In [1]:
#Student name: Andy Steve Lojuntin
#SID: EP0105960

# Link to dataset on Kaggle: https://www.kaggle.com/datasets/xaviernogueira/aqueduct-30-water-risk-atlas-basin-scores

In [8]:
# Declaring required packages

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pyarrow.parquet as pq

Reading parquet file requires us to use pd.read_parquet(filename, engine = enginename). Read more in the following link:

https://stackoverflow.com/questions/33813815/how-to-read-a-parquet-file-into-pandas-dataframe

In [16]:
# Reading the parquet file named "aqueduct30_basin_scores.parquet" within the same directory 


basin_scores = pq.read_table('aqueduct30_basin_scores.parquet')

In [19]:
df_basin = basin_scores.to_pandas()
df_basin.to_csv('aqueduct30_basin_scores.csv')

In [20]:
df_basin.head(10)

Unnamed: 0_level_0,string_id,pfaf_id,gid_1,aqid,bws_score,bwd_score,iav_score,sev_score,gtd_score,rfr_score,...,w_awr_ong_rrr_score,w_awr_ong_tot_score,w_awr_smc_qan_score,w_awr_smc_qal_score,w_awr_smc_rrr_score,w_awr_smc_tot_score,w_awr_tex_qan_score,w_awr_tex_qal_score,w_awr_tex_rrr_score,w_awr_tex_tot_score
aq30_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,111011-EGY.11_1-3365,111011,EGY.11_1,3365,5.0,4.948243,4.141657,2.887187,,4.180674,...,1.856046,2.030983,4.581623,1.778723,2.165272,4.037438,4.431637,1.786984,2.165272,3.614603
1,111011-EGY.15_1-3365,111011,EGY.15_1,3365,5.0,4.948243,4.141657,2.887187,,4.180674,...,1.856046,2.030983,4.581623,1.778723,2.165272,4.037438,4.431637,1.786984,2.165272,3.614603
2,111011-EGY.15_1-None,111011,EGY.15_1,-9999,5.0,4.948243,4.141657,2.887187,,4.180674,...,1.856046,2.030983,4.581623,1.778723,2.165272,4.037438,4.431637,1.786984,2.165272,3.614603
3,111011-None-3365,111011,-9999,3365,5.0,4.948243,4.141657,2.887187,,4.180674,...,1.133763,1.52418,4.581623,1.73742,1.133763,4.138271,4.431637,1.73742,1.133763,3.649648
4,111011-None-None,111011,-9999,-9999,5.0,4.948243,4.141657,2.887187,,4.180674,...,1.133763,1.52418,4.581623,1.73742,1.133763,4.138271,4.431637,1.73742,1.133763,3.649648
5,111012-EGY.11_1-3365,111012,EGY.11_1,3365,5.0,5.0,4.645469,3.082393,,0.0,...,2.217443,2.21349,4.422714,1.778723,2.410512,3.960088,4.277754,1.786984,2.410512,3.560073
6,111012-EGY.15_1-3365,111012,EGY.15_1,3365,5.0,5.0,4.645469,3.082393,,0.0,...,2.217443,2.21349,4.422714,1.778723,2.410512,3.960088,4.277754,1.786984,2.410512,3.560073
7,111012-EGY.15_1-None,111012,EGY.15_1,-9999,5.0,5.0,4.645469,3.082393,,0.0,...,2.217443,2.21349,4.422714,1.778723,2.410512,3.960088,4.277754,1.786984,2.410512,3.560073
8,111012-EGY.8_1-3365,111012,EGY.8_1,3365,5.0,5.0,4.645469,3.082393,,0.0,...,2.217443,2.21349,4.422714,1.778723,2.410512,3.960088,4.277754,1.786984,2.410512,3.560073
9,111012-None-None,111012,-9999,-9999,5.0,5.0,4.645469,3.082393,,0.0,...,1.778313,1.714568,4.422714,1.73742,1.778313,4.067284,4.277754,1.73742,1.778313,3.562805


In [21]:
df_basin.shape

(68506, 57)

In [22]:
print(df_basin.columns)

Index(['string_id', 'pfaf_id', 'gid_1', 'aqid', 'bws_score', 'bwd_score',
       'iav_score', 'sev_score', 'gtd_score', 'rfr_score', 'cfr_score',
       'drr_score', 'ucw_score', 'cep_score', 'udw_score', 'usa_score',
       'rri_score', 'w_awr_def_qan_score', 'w_awr_def_qal_score',
       'w_awr_def_rrr_score', 'w_awr_def_tot_score', 'w_awr_agr_qan_score',
       'w_awr_agr_qal_score', 'w_awr_agr_rrr_score', 'w_awr_agr_tot_score',
       'w_awr_che_qan_score', 'w_awr_che_qal_score', 'w_awr_che_rrr_score',
       'w_awr_che_tot_score', 'w_awr_con_qan_score', 'w_awr_con_qal_score',
       'w_awr_con_rrr_score', 'w_awr_con_tot_score', 'w_awr_elp_qan_score',
       'w_awr_elp_qal_score', 'w_awr_elp_rrr_score', 'w_awr_elp_tot_score',
       'w_awr_fnb_qan_score', 'w_awr_fnb_qal_score', 'w_awr_fnb_rrr_score',
       'w_awr_fnb_tot_score', 'w_awr_min_qan_score', 'w_awr_min_qal_score',
       'w_awr_min_rrr_score', 'w_awr_min_tot_score', 'w_awr_ong_qan_score',
       'w_awr_ong_qal_score', 'w

#### Understanding the column designations
This dataset is classified by
1. Identifiers
2. Physical Risk Quantity
3. Physical Risk Quality
4. Regulatory and Reputational Risk
