In [58]:
import pandas as pd
import numpy as np
import wbgapi as wb
import yfinance as yf
import wbdata
from sklearn.linear_model import LinearRegression
import numpy as np
import scipy.stats as st
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split



# Overall Rules

- Refrain from saving datasets locally. You may experiment with your answers on a locally saved version of the datasets, but do not upload your local files with your homework as the datasets are very large. In your submitted answers datasets should be read from the original source URL.
- Document all of your steps by writing appropriate markdown cells in your notebook. Refrain from using code comments to explain what has been done.
- Avoid duplicating code. Do not copy and paste code from one cell to another. If copying and pasting is necessary, write a suitable function for the task at hand and call that function.
- Document your use of LLMs (ChatGPT, Claude, Code Pilot etc). Either take screenshots of your steps and include them with this notebook, or give me a full log (both questions and answers) in a markdown file named HW2-LLM-LOG.md.

Failure to adhere to these guidelines will result in a 25-point deduction for each infraction.

# HW2

## Q1

There are 22 countries surrounding the Mediterranean Sea: Spain, France, Monaco, Italy, Slovenia, Croatia, Bosnia and Herzegovina, Montenegro, Albania, Greece, Turkey, Syria, Lebanon, Israel, Palestine, Egypt, Libya, Tunisia, Algeria, and Morocco, with two island countries Malta and Cyprus.

1. Get the following data for every country in the list above from the World Bank Data server (using the `wbgapi` library)

- Adult female literacy (SE.ADT.LITR.FE.ZS)
- Adult female workforce participation rate (SL.TLF.ACTI.ZS)
- Child mortality rate (SP.DYN.IMRT.IN)
- Gini index (SI.POV.GINI)
- Life expectancy (SP.DYN.LE00.IN)
- GDP (NY.GDP.PCAP.CD)

2. Write a function that does linear regression for Log(mortality) against the other variables (except mortality).
3. Analyze the regression results for Spain, France, Turkey, Syria, and Israel.
4. Analyze the results for 2 other countries of your choice.

## Q1-Solutions

### 1) Ingesting the Data

I had already done a mini project on retrieving datas of Mediterreanen Countries before. So I had some help from that [project](https://github.com/etumkaya/381E_data_science/blob/main/HW1.ipynb) on ingesting the data. I wanted to write a function to fetch the data of the right indicators for each country.

In [46]:
def fetch_world_bank_data(country_codes):
    indicators = {
        "SE.ADT.LITR.FE.ZS": "Adult female literacy (% aged 15 and older)",
        "SL.TLF.ACTI.ZS": "Adult female workforce participation rate (% ages 15 and older)",
        "SP.DYN.IMRT.IN": "Child mortality rate (per 1,000 live births)",
        "SI.POV.GINI": "Gini index (World Bank estimate)",
        "SP.DYN.LE00.IN": "Life expectancy at birth (years)",
        "NY.GDP.PCAP.CD": "GDP per capita (current US$)"
    }

    country_data = {}

    for country_code in country_codes:
        try:
            data = wbdata.get_dataframe(indicators, country=country_code)
            country_data[country_code] = data
        except:
            print(f"Failed to fetch data for {country_code}")

    return country_data




I went on to find the right abbreviations for the surrounding countries. Later, I called my function, with inputs of the country codes, and stored the countries in a dictionary called "world_bank_data".

In [47]:
country_codes = ["FRA","ITA","SVN","HRV","BIH","MNE","ALB","GRC","TUR","SYR","LBN","ISR","PSE","EGY","LBY","TUN","DZA","MAR","MLT","CYP","MCO","ESP"]

world_bank_data = fetch_world_bank_data(country_codes)

turkey_data = world_bank_data["TUR"]
turkey_data

Unnamed: 0_level_0,Adult female literacy (% aged 15 and older),Adult female workforce participation rate (% ages 15 and older),"Child mortality rate (per 1,000 live births)",Gini index (World Bank estimate),Life expectancy at birth (years),GDP per capita (current US$)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022,,58.314,,,,10674.504173
2021,,56.321,7.7,,76.032,9743.213131
2020,,54.048,8.1,,75.850,8638.739133
2019,94.424042,57.710,8.6,41.9,77.832,9215.440875
2018,,57.714,9.2,41.9,77.563,9568.836190
...,...,...,...,...,...,...
1964,,,150.4,,53.714,365.133869
1963,,,155.5,,53.173,347.177091
1962,,,160.8,,52.382,307.306286
1961,,,166.2,,51.550,282.742464


It looks fine. However, there are lots of null values which I will be handling before building models.

### 2) Linear Regression Function

At this point, I tried to figure out the best method of validation for this task. I didn't want to use Hold Out Method because I wanted to form confidence intervals to be able to analyze my results better. Cross fold validation was one option but I was not planning on using "time" as a variable and didn't want to have the dependencies of time in my model evaluation. That is why I preffered Monte Carlo Cross Validation.

In [79]:
def linear_regression_monte_carlo(data, N=10):
    X = data.drop(columns=['Child mortality rate (per 1,000 live births)'])  
    y = np.log(data['Child mortality rate (per 1,000 live births)']) 
    
    r2_scores = []

    for i in range(N):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

        model = LinearRegression()
        model.fit(X_train, y_train)
        y_predict = model.predict(X_test)
        
        
        r2 = r2_score(y_test, y_predict)
        r2_scores.append(r2)

    mean_r2 = np.mean(r2_scores)
    interval = st.t.interval(0.95, df=N-1, loc=mean_r2, scale=st.sem(r2_scores))
    
    return mean_r2, interval
    

### 3) 

### Model Evaluation for Spain

First, I will take a look at my dataframe for Spain.

In [51]:
world_bank_data["ESP"].info()

<class 'wbdata.client.DataFrame'>
Index: 63 entries, 2022 to 1960
Data columns (total 6 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   Adult female literacy (% aged 15 and older)                      17 non-null     float64
 1   Adult female workforce participation rate (% ages 15 and older)  32 non-null     float64
 2   Child mortality rate (per 1,000 live births)                     62 non-null     float64
 3   Gini index (World Bank estimate)                                 29 non-null     float64
 4   Life expectancy at birth (years)                                 62 non-null     float64
 5   GDP per capita (current US$)                                     63 non-null     float64
dtypes: float64(6)
memory usage: 3.4+ KB


Some features, namely female literacy and gini index have lots of missing values. Since there are lots of them I don't want to imputate them. I will have two approaches here, firstly I will delete the missing data and see my results. Secondly, I will remove the columns with null data majority and try the model like that. 

In [56]:
Spain_no_na=world_bank_data["ESP"].dropna()

Before I move on, I want to check the correlation table to see what is going on in a clearer way.

In [205]:
Spain_no_na.corr()

Unnamed: 0,Adult female literacy (% aged 15 and older),Adult female workforce participation rate (% ages 15 and older),"Child mortality rate (per 1,000 live births)",Gini index (World Bank estimate),Life expectancy at birth (years),GDP per capita (current US$)
Adult female literacy (% aged 15 and older),1.0,0.580901,-0.85122,0.511491,0.77634,-0.092724
Adult female workforce participation rate (% ages 15 and older),0.580901,1.0,-0.852879,0.907846,0.891739,0.271707
"Child mortality rate (per 1,000 live births)",-0.85122,-0.852879,1.0,-0.832026,-0.952332,0.030913
Gini index (World Bank estimate),0.511491,0.907846,-0.832026,1.0,0.868733,0.061226
Life expectancy at birth (years),0.77634,0.891739,-0.952332,0.868733,1.0,-0.005456
GDP per capita (current US$),-0.092724,0.271707,0.030913,0.061226,-0.005456,1.0


This tells a lot about the data, even though I have small number of data points. There is high correlation with female literacy ,and both life expectancy and also female workforce participation. I will try to choose what I will drop and keep carefully here. I dont want to lose child life expectancy because it has a really high correlation with child mortality, which is my target. On the other hand I find it useful to drop adult female literacy, female workforce participation and Gini index due to their high correlation with life expectancy. I will try my model like this.

In [208]:
Spain_new=Spain_no_na.drop(columns=["Adult female literacy (% aged 15 and older)", "Adult female workforce participation rate (% ages 15 and older)", "Gini index (World Bank estimate)"] )
Spain_new

Unnamed: 0_level_0,"Child mortality rate (per 1,000 live births)",Life expectancy at birth (years),GDP per capita (current US$)
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020,2.6,82.331707,26984.296277
2018,2.7,83.431707,30379.721113
2016,2.7,83.329268,26537.159489
2015,2.8,82.831707,25754.361029
2014,2.8,83.229268,29513.65118
2013,2.9,83.078049,29077.182056
2012,2.9,82.426829,28322.946592
2011,3.0,82.47561,31677.900308
2010,3.2,81.626829,30532.480508
2009,3.3,81.47561,32169.502855


In [212]:
linear_regression_monte_carlo(Spain_new, N=1000)

(0.6279063638270245, (0.5598328995837077, 0.6959798280703412))

To have a comparasion, I want to try it on the dataset where I didnt drop the most correlated columns. I want to see how much it effects the r^2.

In [213]:
linear_regression_monte_carlo(Spain_no_na, N=1000)

(0.3323216518858384, (0.24554342032508336, 0.41909988344659344))

It almost decreased my r^2 mean to half. It shows how important it was to drop those.

### Model Evaluation for Turkey

In [218]:
Turkey_no_na=world_bank_data["TUR"].dropna()
Turkey_no_na.corr()

Unnamed: 0,Adult female literacy (% aged 15 and older),Adult female workforce participation rate (% ages 15 and older),"Child mortality rate (per 1,000 live births)",Gini index (World Bank estimate),Life expectancy at birth (years),GDP per capita (current US$)
Adult female literacy (% aged 15 and older),1.0,0.947211,-0.9762,0.327856,0.947314,0.794527
Adult female workforce participation rate (% ages 15 and older),0.947211,1.0,-0.950396,0.51671,0.966105,0.595473
"Child mortality rate (per 1,000 live births)",-0.9762,-0.950396,1.0,-0.287875,-0.980433,-0.753991
Gini index (World Bank estimate),0.327856,0.51671,-0.287875,1.0,0.377441,-0.084016
Life expectancy at birth (years),0.947314,0.966105,-0.980433,0.377441,1.0,0.668591
GDP per capita (current US$),0.794527,0.595473,-0.753991,-0.084016,0.668591,1.0


I will have the same approach. This time I need to drop "adult female literacy" and "adult female workforce".

In [222]:
Turkey_new=Turkey_no_na.drop(columns=["Adult female literacy (% aged 15 and older)", "Adult female workforce participation rate (% ages 15 and older)"] )
linear_regression_monte_carlo(Turkey_new, N=1000)

(0.913174917994578, (0.8973921954118251, 0.9289576405773308))

The model performed far better than I expected. 

### Model Evaluation for Israel

In [224]:
world_bank_data["ISR"].info()


<class 'wbdata.client.DataFrame'>
Index: 63 entries, 2022 to 1960
Data columns (total 6 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   Adult female literacy (% aged 15 and older)                      1 non-null      float64
 1   Adult female workforce participation rate (% ages 15 and older)  32 non-null     float64
 2   Child mortality rate (per 1,000 live births)                     62 non-null     float64
 3   Gini index (World Bank estimate)                                 22 non-null     float64
 4   Life expectancy at birth (years)                                 58 non-null     float64
 5   GDP per capita (current US$)                                     38 non-null     float64
dtypes: float64(6)
memory usage: 3.4+ KB


This dataset only has 1 non-null values for female literacy. Thus, I will drop that column and perform the other tasks.

In [225]:
world_bank_data["ISR"].drop(columns=["Adult female literacy (% aged 15 and older)"],inplace=True)

In [228]:
Israel_new=world_bank_data["ISR"].dropna()

In [229]:
Israel_new.corr()

Unnamed: 0,Adult female workforce participation rate (% ages 15 and older),"Child mortality rate (per 1,000 live births)",Gini index (World Bank estimate),Life expectancy at birth (years),GDP per capita (current US$)
Adult female workforce participation rate (% ages 15 and older),1.0,-0.890512,-0.256966,0.899628,0.941327
"Child mortality rate (per 1,000 live births)",-0.890512,1.0,-0.070339,-0.990759,-0.905544
Gini index (World Bank estimate),-0.256966,-0.070339,1.0,0.000185,-0.27156
Life expectancy at birth (years),0.899628,-0.990759,0.000185,1.0,0.929013
GDP per capita (current US$),0.941327,-0.905544,-0.27156,0.929013,1.0


This is surprising to see. Life expectancy and child mortality is 0.99 positively correlated.  I will drop gdp and female workforce this time.

In [230]:
Israel_new1=Israel_new.drop(columns=["Adult female workforce participation rate (% ages 15 and older)", "GDP per capita (current US$)"] )

In [231]:
linear_regression_monte_carlo(Israel_new, N=1000)

(0.9739051233031243, (0.9721862616247915, 0.9756239849814571))

### Model Evaluation for France

In [232]:
world_bank_data["FRA"].info()


<class 'wbdata.client.DataFrame'>
Index: 63 entries, 2022 to 1960
Data columns (total 6 columns):
 #   Column                                                           Non-Null Count  Dtype  
---  ------                                                           --------------  -----  
 0   Adult female literacy (% aged 15 and older)                      0 non-null      object 
 1   Adult female workforce participation rate (% ages 15 and older)  32 non-null     float64
 2   Child mortality rate (per 1,000 live births)                     62 non-null     float64
 3   Gini index (World Bank estimate)                                 30 non-null     float64
 4   Life expectancy at birth (years)                                 62 non-null     float64
 5   GDP per capita (current US$)                                     63 non-null     float64
dtypes: float64(5), object(1)
memory usage: 3.4+ KB


Once again, I will drop female literacy because I dont have any data. Then I will perform the same operations calling the function I designed.

In [234]:
France_no_na=world_bank_data["FRA"].drop(columns=["Adult female literacy (% aged 15 and older)"])
France_no_na.corr()

Unnamed: 0,Adult female workforce participation rate (% ages 15 and older),"Child mortality rate (per 1,000 live births)",Gini index (World Bank estimate),Life expectancy at birth (years),GDP per capita (current US$)
Adult female workforce participation rate (% ages 15 and older),1.0,-0.787194,0.033874,0.936699,0.789432
"Child mortality rate (per 1,000 live births)",-0.787194,1.0,0.744426,-0.936331,-0.883921
Gini index (World Bank estimate),0.033874,0.744426,1.0,-0.597313,-0.523738
Life expectancy at birth (years),0.936699,-0.936331,-0.597313,1.0,0.979236
GDP per capita (current US$),0.789432,-0.883921,-0.523738,0.979236,1.0


In [242]:
France_new1

Unnamed: 0_level_0,"Child mortality rate (per 1,000 live births)",Life expectancy at birth (years)
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020,3.4,82.17561
2019,3.4,82.826829
2018,3.4,82.67561
2017,3.3,82.57561
2016,3.2,82.573171
2015,3.2,82.321951
2014,3.1,82.719512
2013,3.1,82.219512
2012,3.1,81.968293
2011,3.1,82.114634


In [241]:
France_new=France_no_na.dropna()
France_new1=France_new.drop(columns=["Adult female workforce participation rate (% ages 15 and older)","GDP per capita (current US$)","Gini index (World Bank estimate)"])

In [244]:
linear_regression_monte_carlo(France_new1, N=1000)

(0.2886000338591023, (0.2075539068959415, 0.36964616082226315))

### 4)
 
### Models for Albania and Greece

### Albania

In [247]:
Albania=world_bank_data["ALB"]
Albania.isna().sum()

Adult female literacy (% aged 15 and older)                        58
Adult female workforce participation rate (% ages 15 and older)    31
Child mortality rate (per 1,000 live births)                       19
Gini index (World Bank estimate)                                   51
Life expectancy at birth (years)                                    1
GDP per capita (current US$)                                       24
dtype: int64

In [248]:
Albania_new=Albania.drop(columns=["Adult female literacy (% aged 15 and older)","Gini index (World Bank estimate)"])
Albania_new1=Albania_new.dropna()

In [255]:
Albania_new1.corr()

Unnamed: 0,Adult female workforce participation rate (% ages 15 and older),"Child mortality rate (per 1,000 live births)",Life expectancy at birth (years),GDP per capita (current US$)
Adult female workforce participation rate (% ages 15 and older),1.0,0.414611,-0.426228,-0.327359
"Child mortality rate (per 1,000 live births)",0.414611,1.0,-0.954404,-0.961266
Life expectancy at birth (years),-0.426228,-0.954404,1.0,0.88647
GDP per capita (current US$),-0.327359,-0.961266,0.88647,1.0


In [249]:
linear_regression_monte_carlo(Albania_new1, N=1000)

(0.9245394366781928, (0.9207991462530593, 0.9282797271033264))

### Greece

In [250]:
Greece=world_bank_data["GRC"]
Greece.isna().sum()

Adult female literacy (% aged 15 and older)                        59
Adult female workforce participation rate (% ages 15 and older)    31
Child mortality rate (per 1,000 live births)                        1
Gini index (World Bank estimate)                                   43
Life expectancy at birth (years)                                    1
GDP per capita (current US$)                                        0
dtype: int64

In [280]:
Greece_new=Greece.drop(columns=["Adult female literacy (% aged 15 and older)"])
Greece_new1=Greece_new.dropna()

In [273]:
Greece_new1.corr()

Unnamed: 0,Adult female workforce participation rate (% ages 15 and older),"Child mortality rate (per 1,000 live births)",Gini index (World Bank estimate),Life expectancy at birth (years),GDP per capita (current US$)
Adult female workforce participation rate (% ages 15 and older),1.0,-0.944344,-0.195992,0.835477,0.482236
"Child mortality rate (per 1,000 live births)",-0.944344,1.0,0.341382,-0.74208,-0.629064
Gini index (World Bank estimate),-0.195992,0.341382,1.0,-0.072331,-0.242747
Life expectancy at birth (years),0.835477,-0.74208,-0.072331,1.0,0.149752
GDP per capita (current US$),0.482236,-0.629064,-0.242747,0.149752,1.0


In [281]:
Greece_new1=Greece_new1.drop(columns=["Life expectancy at birth (years)","GDP per capita (current US$)","Gini index (World Bank estimate)"])

In [282]:
Greece_new1

Unnamed: 0_level_0,Adult female workforce participation rate (% ages 15 and older),"Child mortality rate (per 1,000 live births)"
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020,66.008,3.4
2019,67.039,3.5
2018,66.903,3.7
2017,67.078,3.8
2016,67.116,3.9
2015,67.003,3.8
2014,66.828,3.8
2013,66.943,3.6
2012,66.895,3.5
2011,66.628,3.5


In [285]:
linear_regression_monte_carlo(Greece_new1, N=1000)

(-0.16540172799785305, (-0.32812889141687873, -0.002674564578827393))

This was an unusual case for me. R^2 shouldnt be negative. I will try to comment on why this happened.

## Q2

Get the following commodity price data from yahoo finance using the `yfinance` library:

- Silver (SI=F)
- Copper (HG=F)
- Platinum (PL=F)
- Gold (GC=F)
- Palladium (PA=F)

1. Write a linear regression model that relates the gold futures in terms of the other precious metals.
2. Analyze the regression results.
3. Does the model improve if we add interaction terms? Explain.
4. Now, do the same for each futures in the list above.

## Q3

Use the *Acoustic Extinguisher Fire Dataset* from Murat Köklü's [data server](https://www.muratkoklu.com/datasets/).

1. Explore the dataset, and project it to 2D space using PCA and LDA. Color the data points using the `STATUS` column.
2. Construct an SVM model to model the `STATUS` column and measure its quality using Accuracy, Precision, Recall, and F-1.
3. Construct a Logistic Regression model to model the `STATUS` column and measure its quality using Accuracy, Precision, Recall, and F-1.
4. Using the LR model, determine which variables affect the most the `STATUS` column.

## Q4

Use the hyperspectral image data (ROSIS sensor data over Pavia Italy) we used for Question 2 from HW1 for this question.

1. Load both the image data and the ground truth data. Reshape the image and name is as `vectors` and the ground truth data as `labels`. 
2. Remove all data points whose label is 0.
3. Write a function that construct a multi-label logistic regression model relating `vectors` to `labels`, and analyzes the accuracy using a correct statistical methodology. Analyze the accuracy results.
4. Now, run a model once over a single training and test set. Report the accuracy, precision, recall, and F1 per label basis. 
5. Repeat (3) and (4) for a multi-label SVM model.
6. Construct confusion matrices over a single run for both LR and SVM, and compare. Present your conclusions.