<a href="https://colab.research.google.com/github/boboguan/QM2Gr13/blob/main/Phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Impact of Oil Dependency on the Socio-Economic Development of Major Oil Exporters**



In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer
import seaborn as sns
import numpy as np
import plotly
import plotly.express as px
import warnings
from statsmodels.formula.api import ols
from statsmodels.iolib.summary2 import summary_col
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
sns.set(font_scale=1.5)
sns.set_style("white")
plt.rcParams['figure.figsize'] = (12, 8)

1. Download Necessary Libraries and Upload Database Into a Data file
2. Use appropriate Data Imputation Techniques to Fill in Missing Data


In [None]:
!mkdir data
!mkdir data/grproject

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
df = pd.read_csv('Oil Exporting Country Data - COUNTRY DATA.csv', skiprows = 2)

df.columns = ['Country', 'Year', 'GDP per Capita', 'Oil Rent',
              'Debt to GDP Ratio', 'HDI', 'Democracy Index',
              'Civil Rights Freedom Indexes', 'Gini Coefficient']

In [None]:
# Convert 'Year' to integer and other numerical columns to float
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')  # Convert to numeric, making non-numeric values NaN
df.dropna(subset=['Year'], inplace=True)  # Drop rows where 'Year' is NaN
df['Year'] = df['Year'].astype(int)
df['GDP per Capita'] = pd.to_numeric(df['GDP per Capita'], errors='coerce')
df['Oil Rent'] = pd.to_numeric(df['Oil Rent'], errors='coerce')
df['Debt to GDP Ratio'] = pd.to_numeric(df['Debt to GDP Ratio'], errors='coerce')
df['HDI'] = pd.to_numeric(df['HDI'], errors='coerce')
df['Gini Coefficient'] = pd.to_numeric(df['Gini Coefficient'], errors='coerce')

# Drop completely empty rows if any
df.dropna(how='all', inplace=True)

In [None]:
#Data Imputation
df['Gini Coefficient'] = df['Gini Coefficient'].interpolate(method='linear')
# Replace missing values with the mean or median of the column
df['GDP per Capita'].fillna(df['GDP per Capita'].mean(), inplace=True)
df['Oil Rent'].fillna(df['Oil Rent'].mean(), inplace=True)
df['Debt to GDP Ratio'].fillna(df['Debt to GDP Ratio'].median(), inplace=True)

In [None]:
# Replace missing values with the mode (most frequent value)
# Assuming these are categorical or ordinal and have a common frequent value
df['Democracy Index'].fillna(df['Democracy Index'].mode()[0], inplace=True)
df['Civil Rights Freedom Indexes'].fillna(df['Civil Rights Freedom Indexes'].mode()[0], inplace=True)


In [None]:
# Verify the changes
print(df.head())
print(df.isnull().sum())

In [None]:
print(df)

As seen on the graph, the Gini Coefficienct for Saudi Arabia can not be interpolated as there is only one data point of reference, 2019. Thus this is a weakness.

In [None]:
#Data Cleaning
df = df.replace(r'^\s*$', np.nan, regex=True) #replace the empty string or strings composed with whitespace charater np.nan
df = df.apply(pd.to_numeric, errors='coerce') # change to numeric
df = df.replace(' ', 'NaN', regex=True)

In [None]:
print(df)

In [None]:
!pip install linearmodels

In [None]:
#Panel Regression
import pandas as pd
from linearmodels import PanelOLS
from linearmodels import RandomEffects
import statsmodels.formula.api as smf
from linearmodels.panel import compare
# set GDP/Capita as dependent variable and Oil Rent, Debt to GDP Ratio and HDI as independent variable
fe_model = PanelOLS.from_formula('GDP/Capita (USD Thousand) ~ Oil Rent + Debt to GDP Ratio + HDI + EntityEffects', data=df)
fe_results = fe_model.fit()
print(fe_results)
