# Question 1

For this question use the World Bank Data for Turkey for the following indicators. Use [wbgapi](https://pypi.org/project/wbgapi/) for getting the data.

* [Literacy rate, adult female (SE.ADT.LITR.FE.ZS)](https://data.worldbank.org/indicator/SE.ADT.LITR.FE.ZS)
* [Labor force, female (SL.TLF.TOTL.FE.ZS)](https://data.worldbank.org/indicator/SL.TLF.TOTL.FE.ZS)
* [Poverty headcount ratio at national poverty lines (SI.POV.NAHC)](https://data.worldbank.org/indicator/SI.POV.NAHC)
* [Current health expenditure per capita (SH.XPD.CHEX.PC.CD)](https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD)
* [GDP per capita (NY.GDP.PCAP.CD)](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
* [Mortality rate, under-5 (SH.DYN.MORT)](https://data.worldbank.org/indicator/SH.DYN.MORT)


Using the [statsmodels](https://www.statsmodels.org/stable/index.html) library write the best linear regression model using child mortality as the dependent variable while the rest are considered as independent variables. Pay particular attention to the fact that the order of the variables put into the model significantly impacts the performance of the model. Choose the best model by considering

* with the minimum number of variables and their interactions,
* with the optimal ordering of the independent variables and their interactions,
* $R^2$-score of the model,
* statistical significance of the model coefficients,
* ANOVA analysis of the model.


In [1]:
import pandas as pd
import wbgapi as wb
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.formula.api import logit
from statsmodels.api import Logit
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
import functools as ft
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier as knn
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.cluster import KMeans as kmeans
from scipy.spatial.distance import euclidean as d2
import statsmodels.formula.api as smf

In [4]:
lit = pd.DataFrame(list(wb.data.fetch('SE.ADT.LITR.FE.ZS')))
lab = pd.DataFrame(list(wb.data.fetch("SL.TLF.TOTL.FE.ZS")))
power = pd.DataFrame(list(wb.data.fetch("SI.POV.NAHC")))
heal = pd.DataFrame(list(wb.data.fetch("SH.XPD.CHEX.PC.CD")))
gdp = pd.DataFrame(list(wb.data.fetch("NY.GDP.PCAP.CD")))
mort = pd.DataFrame(list(wb.data.fetch("SH.DYN.MORT")))


In [7]:
def extract(df,cntry,name):
    tmp = df[['time','value']][df['economy']==cntry]
    tmp.index = tmp.time
    del tmp['time']
    tmp.columns = [[name]]
    return tmp


In [9]:
def total(cntry):
    literacy = extract(lit,cntry,'literacy')
    labor = extract(lab,cntry,'labor')
    powerty = extract(power,cntry,'powerty')
    heall = extract(heal,cntry,'health')
    gdpp = extract(gdp,cntry,'gdp')
    mor = extract(mort,cntry,'mortality')
    res = literacy.join([labor,powerty,heall,gdpp,mor])
    res.dropna(inplace=True)
    return res


In [11]:
res = total('TUR')
X = res[["labor","powerty","health","gdp"]]
XX = sm.add_constant(X)
Y = res['mortality']
model = sm.OLS(Y,XX)
results = model.fit()
print(results.summary())

est = smf.ols(formula='mortality ~ gdp + powerty + labor + health', data=res).fit()
est.summary()

sm.stats.anova_lm(est)



                            OLS Regression Results                            
Dep. Variable:              mortality   R-squared:                       0.992
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                     242.2
Date:                Mon, 07 Nov 2022   Prob (F-statistic):           2.23e-08
Time:                        11:09:31   Log-Likelihood:                -7.8525
No. Observations:                  13   AIC:                             25.70
Df Residuals:                       8   BIC:                             28.53
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           70.1838      7.670      9.151   



Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
gdp,1.0,121.486101,121.486101,381.495275,4.906918e-08
powerty,1.0,124.230632,124.230632,390.113757,4.494799e-08
labor,1.0,60.348401,60.348401,189.508344,7.482327e-07
health,1.0,2.467288,2.467288,7.747872,0.02379571
Residual,8.0,2.547578,0.318447,,


# Question 2

For this question use Yahoo's Finance API for the following tickers:

* Gold futures (GC=F)
* Silver futures (SI=F)
* Copper futures (HG=F)
* Platinum futures (PL=F)

1. Write the best linear regression model that explains gold futures closing prices in terms of opening prices of gold, silver, copper, and platinum futures.
2. Repeat the same for silver, copper and platinum prices.
3. Compare the models you obtained in Steps 1 and 2. Which model is better? How do you decide? Explain.

In [12]:
import yfinance as yf
from statsmodels.stats.anova import anova_lm

gold = yf.download('GC=F',start='2020-01-01')
silver = yf.download('SI=F',start='2020-01-01')
copper = yf.download('HG=F',start='2020-01-01')
platinum = yf.download('PL=F',start='2020-01-01')


[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


In [13]:
tmp = {}
tmp['gold_open'] = gold['Open']
tmp['gold_close'] = gold['Close']
tmp['silver_open'] = silver['Open']
tmp['silver_close'] = silver['Close']
tmp['copper_open'] = copper['Open']
tmp['copper_close'] = copper['Close']
tmp['platinum_open'] = platinum['Open']
tmp['platinum_close'] = platinum['Close']
data = pd.DataFrame(tmp).dropna()
data

Unnamed: 0_level_0,gold_open,gold_close,silver_open,silver_close,copper_open,copper_close,platinum_open,platinum_close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-01-02 00:00:00-05:00,1518.099976,1524.500000,17.966000,17.966000,2.8165,2.8330,985.000000,978.599976
2020-01-03 00:00:00-05:00,1530.099976,1549.199951,18.110001,18.068001,2.7935,2.7985,986.500000,984.500000
2020-01-06 00:00:00-05:00,1580.000000,1566.199951,18.025000,18.097000,2.7780,2.8005,985.799988,960.400024
2020-01-07 00:00:00-05:00,1558.300049,1571.800049,18.014999,18.316000,2.8010,2.8040,965.299988,966.000000
2020-01-08 00:00:00-05:00,1579.699951,1557.400024,18.400000,18.087999,2.7870,2.8190,968.900024,959.000000
...,...,...,...,...,...,...,...,...
2022-11-01 00:00:00-04:00,1630.800049,1645.000000,19.125000,19.673000,3.4945,3.5095,959.799988,959.799988
2022-11-02 00:00:00-04:00,1650.800049,1645.699951,19.780001,19.600000,3.4985,3.5055,960.200012,960.200012
2022-11-03 00:00:00-04:00,1629.199951,1627.300049,19.235001,19.436001,3.4455,3.4565,933.400024,933.400024
2022-11-04 00:00:00-04:00,1630.199951,1672.500000,19.980000,20.790001,3.6370,3.7145,969.799988,969.799988


In [360]:
model_gold = ols('gold_close ~ gold_open + silver_open + copper_open + platinum_open', data=data).fit()
model_gold2 = ols('gold_close ~ gold_open + copper_open + platinum_open', data=data).fit()
print(model_gold.summary())
print(model_gold2.summary())
print(sm.stats.anova_lm(model_gold))
print(sm.stats.anova_lm(model_gold2))

                            OLS Regression Results                            
Dep. Variable:             gold_close   R-squared:                       0.972
Model:                            OLS   Adj. R-squared:                  0.972
Method:                 Least Squares   F-statistic:                     6137.
Date:                Thu, 03 Nov 2022   Prob (F-statistic):               0.00
Time:                        19:00:44   Log-Likelihood:                -3021.4
No. Observations:                 707   AIC:                             6053.
Df Residuals:                     702   BIC:                             6076.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        29.3195     20.930      1.401

In [355]:
model_silver = ols('silver_close ~ gold_open + copper_open + platinum_open', data=data).fit()
model_silver2 = ols('silver_close ~ copper_open + platinum_open', data=data).fit()
print(model_silver.summary())
print(model_silver2.summary())
print(sm.stats.anova_lm(model_silver))
print(sm.stats.anova_lm(model_silver2))

                            OLS Regression Results                            
Dep. Variable:           silver_close   R-squared:                       0.891
Model:                            OLS   Adj. R-squared:                  0.890
Method:                 Least Squares   F-statistic:                     1909.
Date:                Thu, 03 Nov 2022   Prob (F-statistic):               0.00
Time:                        18:59:02   Log-Likelihood:                -1142.6
No. Observations:                 707   AIC:                             2293.
Df Residuals:                     703   BIC:                             2311.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept       -31.6528      0.855    -37.036

In [356]:
model_copper = ols('copper_close ~ gold_open + silver_open + platinum_open', data=data).fit()
model_copper2 = ols('copper_close ~ silver_open + platinum_open', data=data).fit()
print(model_copper.summary())
print(model_copper2.summary())
print(sm.stats.anova_lm(model_copper))
print(sm.stats.anova_lm(model_copper2))

                            OLS Regression Results                            
Dep. Variable:           copper_close   R-squared:                       0.507
Model:                            OLS   Adj. R-squared:                  0.505
Method:                 Least Squares   F-statistic:                     241.3
Date:                Thu, 03 Nov 2022   Prob (F-statistic):          1.30e-107
Time:                        18:59:02   Log-Likelihood:                -560.09
No. Observations:                 707   AIC:                             1128.
Df Residuals:                     703   BIC:                             1146.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        -3.2784      0.632     -5.187

In [357]:
model_platinum = ols('platinum_close ~ gold_open + silver_open + copper_open', data=data).fit()
model_platinum2 = ols('platinum_close ~ copper_open + silver_open', data=data).fit()
print(model_platinum.summary())
print(model_platinum2.summary())
print(sm.stats.anova_lm(model_platinum))
print(sm.stats.anova_lm(model_platinum2))


                            OLS Regression Results                            
Dep. Variable:         platinum_close   R-squared:                       0.803
Model:                            OLS   Adj. R-squared:                  0.803
Method:                 Least Squares   F-statistic:                     957.3
Date:                Thu, 03 Nov 2022   Prob (F-statistic):          1.02e-247
Time:                        18:59:03   Log-Likelihood:                -3824.9
No. Observations:                 707   AIC:                             7658.
Df Residuals:                     703   BIC:                             7676.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    1342.7751     41.354     32.471      

In [368]:
print(anova_lm(model_gold, model_gold2),"\n")
print(anova_lm(model_silver, model_silver2),"\n")
print(anova_lm(model_copper, model_copper2),"\n")
print(anova_lm(model_platinum, model_platinum2),"\n")

   df_resid            ssr  df_diff    ss_diff         F  Pr(>F)
0     702.0  213259.699276      0.0        NaN       NaN     NaN
1     703.0  213359.112520     -1.0 -99.413245  0.327558     NaN 

   df_resid          ssr  df_diff      ss_diff           F  Pr(>F)
0     703.0  1048.789743      0.0          NaN         NaN     NaN
1     704.0  3617.643160     -1.0 -2568.853417  499.903591     NaN 

   df_resid         ssr  df_diff   ss_diff          F  Pr(>F)
0     703.0  201.856111      0.0       NaN        NaN     NaN
1     704.0  208.298657     -1.0 -6.442546  21.774277     NaN 

   df_resid           ssr  df_diff       ss_diff           F  Pr(>F)
0     703.0  2.070337e+06      0.0           NaN         NaN     NaN
1     704.0  3.857883e+06     -1.0 -1.787545e+06  326.197573     NaN 



It can be interpreted by looking at the "R-squared" value which model is more compatible with the real value. While this value takes a value between 0 and 1, the larger it is, the more overlap it means. While the R-squared value for the model in step 1 was 0.972, this value was calculated as 0.891 in step 2. Therefore, the model in step 1 can be said to be closer to reality.

# Question 3

1. Write a function that takes a ticker symbol and returns a pandas dataframe that for each day puts a 1 when the closing price is higher than the opening price, a 0 when the closing price is lower than the opening price.
2. Write the best logistic regression that predicts the time series you obtain from Step 1 for gold futures against the opening prices of gold, silver, copper, and platinum prices.
3. Repeat the same for silver, copper, and platinum prices.
4. Compare the models you obtained from Steps 2 and 3. Decide which is the best model, and explain your reasoning.
5. Does any of the models provide a good fit? Explain.

In [14]:
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from zipfile import ZipFile
from io import BytesIO
from urllib.request import urlopen
from collections import Counter
from sklearn.metrics import confusion_matrix

from statsmodels.formula.api import logit
from statsmodels.api import Logit

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeRegressor

In [3]:
import yfinance as yf


def check(ticker_name):
    ticker = yf.Ticker(ticker_name)
    data = ticker.history()
    data["Result"] = data["Open"] < data["Close"]
    data["Result"] = data["Result"].map(lambda x: 1 if x==True else 0)
    print(data)

check("AAPL")



                                 Open        High         Low       Close  \
Date                                                                        
2022-10-05 00:00:00-04:00  143.831404  147.135920  142.773147  146.157532   
2022-10-06 00:00:00-04:00  145.568524  147.295654  144.979504  145.189148   
2022-10-07 00:00:00-04:00  142.303926  142.863011  139.219047  139.857986   
2022-10-10 00:00:00-04:00  140.187439  141.655006  138.340512  140.187439   
2022-10-11 00:00:00-04:00  139.668307  141.115918  137.991096  138.749832   
2022-10-12 00:00:00-04:00  138.899586  140.127544  137.931191  138.110886   
2022-10-13 00:00:00-04:00  134.766453  143.352202  134.147469  142.753204   
2022-10-14 00:00:00-04:00  144.071005  144.280664  137.961145  138.150833   
2022-10-17 00:00:00-04:00  140.836387  142.663343  140.037708  142.174164   
2022-10-18 00:00:00-04:00  145.249056  146.457044  140.377133  143.511932   
2022-10-19 00:00:00-04:00  141.455346  144.709941  141.265658  143.621750   

In [4]:
import yfinance as yf

gold = yf.download('GC=F',start='2020-01-01')
silver = yf.download('SI=F',start='2020-01-01')
copper = yf.download('HG=F',start='2020-01-01')
platinum = yf.download('PL=F',start='2020-01-01')


tmp = {}
tmp['gold_open'] = gold['Open']
tmp['gold_close'] = gold['Close']
tmp['silver_open'] = silver['Open']
tmp['silver_close'] = silver['Close']
tmp['copper_open'] = copper['Open']
tmp['copper_close'] = copper['Close']
tmp['platinum_open'] = platinum['Open']
tmp['platinum_close'] = platinum['Close']
data = pd.DataFrame(tmp).dropna()
data

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


Unnamed: 0_level_0,gold_open,gold_close,silver_open,silver_close,copper_open,copper_close,platinum_open,platinum_close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2020-01-02 00:00:00-05:00,1518.099976,1524.500000,17.966000,17.966000,2.8165,2.8330,985.000000,978.599976
2020-01-03 00:00:00-05:00,1530.099976,1549.199951,18.110001,18.068001,2.7935,2.7985,986.500000,984.500000
2020-01-06 00:00:00-05:00,1580.000000,1566.199951,18.025000,18.097000,2.7780,2.8005,985.799988,960.400024
2020-01-07 00:00:00-05:00,1558.300049,1571.800049,18.014999,18.316000,2.8010,2.8040,965.299988,966.000000
2020-01-08 00:00:00-05:00,1579.699951,1557.400024,18.400000,18.087999,2.7870,2.8190,968.900024,959.000000
...,...,...,...,...,...,...,...,...
2022-11-01 00:00:00-04:00,1630.800049,1645.000000,19.125000,19.673000,3.4945,3.5095,959.799988,959.799988
2022-11-02 00:00:00-04:00,1650.800049,1645.699951,19.780001,19.600000,3.4985,3.5055,960.200012,960.200012
2022-11-03 00:00:00-04:00,1629.199951,1627.300049,19.235001,19.436001,3.4455,3.4565,933.400024,933.400024
2022-11-04 00:00:00-04:00,1630.199951,1672.500000,19.980000,20.790001,3.6370,3.7145,969.799988,969.799988


In [1]:

encoder = OneHotEncoder(sparse=False)
data = encoder.fit_transform(np.array(lambda data : data.iloc[:,:7]).reshape(-1,1))

model = logit('gold_close ~ gold_open + silver_open + copper_open + platinum_open', data=data).fit()
model.summary()
                                                                                                   
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75)

model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)
print(model.score(X_test,y_test))
y_predict = model.predict(X_test)

NameError: name 'OneHotEncoder' is not defined

# Question 4

For this question use the following [data](https://archive.ics.uci.edu/ml/datasets/credit+approval):


1. Split the data into training and test set.
2. Write different logistic regression models predicting y against X.
3. Construct [confusion matrices](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) on the test data set for these different models.
4. Analyze these models. Explain which model is the best model you have found.
5. Repeat Steps 1-4 several times. Does your best model stay as the best model? What should be the correct protocol to decide on the best model explaining the data?

In [17]:
credit = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data', header=None)

fn = {'+': 1, '-': 0}

X = credit.replace('?',0).iloc[:,[1,2,7,10,14]]
y = credit.iloc[:,15].map(lambda x: fn.get(x,0))
credit

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.000,u,g,w,v,1.25,t,t,1,f,g,00202,0,+
1,a,58.67,4.460,u,g,q,h,3.04,t,t,6,f,g,00043,560,+
2,a,24.50,0.500,u,g,q,h,1.50,t,f,0,f,g,00280,824,+
3,b,27.83,1.540,u,g,w,v,3.75,t,t,5,t,g,00100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,00120,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,b,21.08,10.085,y,p,e,h,1.25,f,f,0,f,g,00260,0,-
686,a,22.67,0.750,u,g,c,v,2.00,f,t,2,t,g,00200,394,-
687,a,25.25,13.500,y,p,ff,ff,2.00,f,t,1,t,g,00200,1,-
688,b,17.92,0.205,u,g,aa,v,0.04,f,f,0,f,g,00280,750,-


In [15]:
def bootstrap(X,y,model):
    res = []
    for i in range(100):
        X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=0.75)
        model.fit(X_train,y_train)
        res.append(model.score(X_test,y_test))
    tmp = sorted(res)[3:97]
    return (min(tmp),max(tmp))


In [18]:

model = logit('14 ~ 1 + 2 + 7 + 10', data=credit).fit()
bootstrap(X,y,model)
model.summary()

PatsyError: numbers besides '0' and '1' are only allowed with **
    14 ~ 1 + 2 + 7 + 10
    ^^