# <font color = "slateblue"> EXAM I </font>
#### <font color = "slateblue"> Math and Stats</font>
#### <font color = "slateblue"> Date: 11th, Feb 2021 </font>

---



## <font color = "limegreen">Student Data</font>

Fill your **NAME** only:

#### Student Name: ..........Bea....................

---


## <font color = "limegreen">Instructions</font>

Read carefully these instructions and follow them during your quiz and in your submission.

 * The exam lasts **1 week**
 * Read carefully the questions and do not answer before knowing what is asked
 * Full marks require **full explanations**. Just answering the question is not enough, for example, if one answer is that the type of data is *panel data*, just saying that will not grant you more than the 25% of the available points.
 * The **answers** must be written right below the questions made in this notebook. Use Code and Text cells as needed


---

## <font color = "limegreen"> Packages </font>

In the next cell code add **ALL** the modules you will use in your exam: `numpy`, `pandas`,... 

In [None]:
# this extension helps improve code readability
%load_ext nb_black
###
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

plt.style.use("seaborn")

In case you work with colab notebooks, use the following cell to connect to your drive

In [None]:
# not necessary because we won't be using collab

---

## <font color = "limegreen">The Data </font>

**A study about the scientific reasearch status around the world contains data from countries in Europe and in America in year 2016. It contains the following variables:**

<br>

| Variable    |           Description           |
|-------------|---------------------------------|
|  country    | country name |
|  continent  | continent to which the country belongs |
|  RDExpen    | Research and development expenditure (% of GDP)|
|  STJpapers  | Scientific and technical journal articles |
| researchers | Researchers in R&D (per million people) | 
<br>

**in order to load it, use the next *code cell*, taking into account that it is an Excel file, not a .csv. Add in this same code cell a instruction that lets you see the 5 first rows of the dataset.**

In [None]:
# read the file and show the first rows to check everything is ok
science_data = pd.read_excel("science_indicators.xlsx")
science_data.head()


---

## Questions

## 1.- <font color = "Red"> The Data Set </font>

**In this first part, let's briefly describe the dataset**

### 1.1.- <font color = "Blue">Data (1 Point)</font>

**From the statistics point of view, what type of data can you find in the study?**

The data supplied is *cross-sectional* because we have **collected** scientific-related **information** (the randomness applies in the sense that there might be data not included, even for a country for which we do have scientific information) **from** a bunch of **different individuals** (countries) **during the same period of time**, the year 2016.

### 1.2.- <font color = "Blue">Variables (1 Point)</font>

**Describe the type of variables you can find in the study both, from the point of view of their nature and from the point of view of their role (again from the statistics point of view)**

There are 5 variables in the data set.<br>
<br>
From the <b>nature</b> point of view:<br>
<ul><li> <i>Country Name</i> - categorical, nominal</li>
    <li> <i>Continent</i> - categorical, nominal</li>
    <li> <i>RDExpen</i> - numerical, continuous (as it measures the ratio between the expenditure and the GDP)</li>
    <li> <i>STJpapers</i> - numerical, apparently continuous (it's a little extrange that we talk about papers in decimals)</li>
    <li> <i>researchers</i> - numerical, continuous (as it is presented as a ratio between population and scientists)</li></ul>
<br>
Regarding their <b>roles</b>:<br>
<ul><li> <i>Country Name</i> - confounding. Science expense would be enough to analyze our information, but we cannot disregard which country we are talking about (same size countries may have different interests in science and, moreover, the expense variable is presented as a percentage of each country's GDP).</li>
    <li> <i>Continent</i> - confounding but dependent on <i>Country Name</i> (every country belongs to a continent and it cannot be changed (at least not that easily))</li>
    <li> <i>RDExpen</i> - explanatory but dependent on <i>Country Name</i> (every country has a science budget as a percentage of their GDP that may differ depending on their interest in science)</li>
    <li> <i>STJpapers</i> - response, dependent on <i>RDExpen</i> and <i>researchers</i> (the more money a country invests in science, the more researchers can be hired and therefore, more scientific papers produced)</li>
    <li> <i>researchers</i> - can be both explanatory and response, depending on our point of view. If we take it as explanatory and therefore independent, we might argue that the more researchers a country hires, the more papers they will produce. But we can also consider it a response variable, dependent on the amount of R+D expense: the more money available the more researchers will be hired.</li></ul>


### 1.3.- <font color = "Blue">Population and Sample (1 Point)</font>

**Explain which is the population of interest and which are the statistical units**

The population is the whole scientific research in Europe and America, out of which we got information from some European and American countries (the statistical units).

### 1.4.- <font color = "Blue">Sample Size (1 Point)</font>

**Determine the sample size using Python. Then determine how many countries belong to Europe and how many to America in this sample**

In [None]:
print(
    "The sample size is",
    science_data.shape[0],
    "countries for which we have science-related information.",
)

In [None]:
continents = (
    science_data.loc[:, ["Country Name", "Continent"]].groupby(by="Continent").count()
)
print(
    "There are",
    continents.at["america", "Country Name"],
    "American countries and",
    continents.at["europe", "Country Name"],
    "European countries in our sample data.",
)

### 1.5.- <font color = "Blue">Sizes (1 Point)</font>

**If you split the `RDExpen` and `STJpapers` variables in three levels denoted by `low`, `mid` and `high` (using the own range of the variables), determine the number of countries in our sample that are in each of the levels generated by this division**

In [None]:
# first check if we have RDExpen data for all the countries
science_data["RDExpen"].count() == science_data.shape[0]

In [None]:
# next we check if we have STJpapers data for all the countries
science_data["STJpapers"].count() == science_data.shape[0]

In [None]:
# we do for STJpapers but not for RDExpen, so we must take that into account
# by adding empty values also in our categorical variables

# we are doing the same process twice, so we are going to create a for loop
vars2cat = ["RDExpen", "STJpapers"]

for var in vars2cat:
    # first define the limits for the low, mid and high values. we choose
    # the Q1 and Q3 quartiles.
    low_limit = science_data[var].quantile(0.25)
    mid_limit = science_data[var].quantile(0.75)

    # then we add the new categorical variable
    science_data[var + "_cat"] = np.where(
        np.isnan(science_data[var]),
        None,
        (
            np.where(
                science_data[var] <= low_limit,
                "low",
                (np.where(science_data[var] <= mid_limit, "mid", "high")),
            )
        ),
    )

    # display(science_data[var + "_cat"].value_counts())
    print(
        "There are",
        science_data[var + "_cat"].value_counts()["low"],
        "countries in the low level,",
        science_data[var + "_cat"].value_counts()["mid"],
        "countries in the mid level and",
        science_data[var + "_cat"].value_counts()["high"],
        "countries in the high level of",
        var,
    )

## 2.- <font color = "Red"> EDA </font>

**The next questions are for the variables in the dataset**

### 2.1.- <font color = "Blue"> Graphical Analysis (2 Points) </font> 

**Plot side by side its histogram and the boxplot. Then answer the following questions:**

 * **Is the distribution symmetric? Which value of skewness would you expect?**
 * **Do you detect any outliers? Which value of the excess kurtosis would you expect?**
 * **Which central tendency and variability measures would you use to describe the distribution?**

In [None]:
# we are going to repeat the same plots for the three variables, so we better use a for loop
dict_plots = {
    "RDExpen": {"plot_order": 0, "main_color": "green", "sec_color": "lightgreen"},
    "STJpapers": {"plot_order": 1, "main_color": "red", "sec_color": "lightcoral"},
    "researchers": {"plot_order": 2, "main_color": "blue", "sec_color": "lightblue"},
}

# first we configure the main plot
plt.figure(figsize=(16, 18))

# then we loop to create two plots for each variable
for dict_plot in dict_plots:

    # then we define the first subplot with the histogram, and if there were empty values, we discard them
    plt.subplot2grid((3, 2), (dict_plots[dict_plot]["plot_order"], 0))
    plt.hist(
        science_data[dict_plot].dropna(),
        ec=dict_plots[dict_plot]["main_color"],
        color=dict_plots[dict_plot]["sec_color"],
        bins="rice",
    )
    plt.title(dict_plot + " Histogram", fontsize=15)
    plt.xlabel("Values", fontsize=12)
    plt.ylabel("Frequency", fontsize=12)

    # to create the boxplot we must discard empty values again if any
    plt.subplot2grid((3, 2), (dict_plots[dict_plot]["plot_order"], 1))
    plt.boxplot(
        science_data[dict_plot].dropna(),
        widths=0.6,
        patch_artist=True,
        showmeans=True,
        whis=1.5,
        labels=[dict_plot],
        boxprops=dict(facecolor=dict_plots[dict_plot]["sec_color"]),
        flierprops=dict(
            marker="o", markerfacecolor=dict_plots[dict_plot]["main_color"]
        ),
    )
    plt.title(dict_plot + " Boxplot", fontsize=15)
    plt.ylabel("Values", fontsize=12)

# last, we print everything on screen
plt.show()

From now on we will describe the <i>RDExpen</i> distribution.

It is not symmetric, but right-skewed, which means many scientific budgets in Europe and America are tight but there are also a significant number of countries with high investments in science. We'd expect a skewness value greater than 0. Let's calculate it:

In [None]:
ske = science_data.RDExpen.skew()
ske

There are no outliers in the plot, and the weight of the tails seems pretty small, specially on one side, so we would say the distribution is playkurtic and should expect a negative kurtosis value. Let's check it:

In [None]:
kur = science_data.RDExpen.kurt()
kur

Since it is not a symmetric distribution, we discard the mean as a proper central tendency measure, and instead use the median to better describe it. From the boxplot we can tell it should be aorund 1.4. Let's calculate it:

In [None]:
med = science_data.RDExpen.median()
med

Following our comment just above, we use IQR as the variability measure to describe our distribution. From the boxplot we can tell it should be around 1.6 (a little above 2.0 minus a little below 0.5):

In [None]:
IQR = science_data["RDExpen"].quantile(0.75) - science_data["RDExpen"].quantile(0.25)
IQR

### 2.2.- <font color = "Blue">Quantitative Analysis (2 Points)</font>

**Make a summary with ALL the numerical quantities needed to describe the distribution. Then interpret them with respect to your arguments in 2.1. Did your expectations match with the numerical results? Explain.**

In [None]:
print("{:^30}".format("CENTRAL TENDENCY"))
print("-" * 30)
print("{:<25}".format("Mean:"), np.round(science_data.RDExpen.mean(), 2))
print("{:<25}".format("Median:"), np.round(med, 2))
print("{:<2}".format("Mode:"), [i for i in science_data.RDExpen.mode()])
print("\n{:^30}".format("VARIABILITY"))
print("-" * 30)
print("{:<25}".format("St. Dev.:"), np.round(science_data.RDExpen.std(), 2))
print("{:<25}".format("Q1:"), np.round(science_data.RDExpen.quantile(0.25), 2))
print("{:<25}".format("Q3:"), np.round(science_data.RDExpen.quantile(0.75), 2))
print("{:<25}".format("IQR:"), np.round(IQR, 2))
print("{:<22}".format("Minimum:"), science_data.RDExpen.min())
print("{:<22}".format("Maximum:"), science_data.RDExpen.max())
print("\n{:^30}".format("SHAPE"))
print("-" * 30)
print("{:<25}".format("Skewness:"), np.round(ske, 2))
print("{:<24}".format("Kurtosis:"), np.round(kur, 2))

The median, IRQ, skewness and kurtosis values match our expectations, as we showed in question 2.2. It is outstanding that there is no single mode value, but if we take into account this is continous numerical variable it is not surprising.

### 2.3.- <font color = "Blue"> Dependency (2 Points)</font>

**Analyze from the point of view of association, correlation and relationship the dependency of the Scientific and technical journal articles, variable `STJpapers` (independent) with the Research and development expenditure, variable `RDExpen` (response). Make the proper graph for this analysis.**

In [None]:
# As we stated in question 1.2, variable "STJpapers" is for us the response variable and "RDExpen" is independent,
# therefore we will make the analysis following that assumption (probably wrong, but at least we will be consistent).

print("In order to find association, we need to calculate the Covariance.")
display(science_data[["RDExpen", "STJpapers"]].cov())
print(
    "The value of sxy",
    "({:.2f})".format(
        science_data[["RDExpen", "STJpapers"]].cov().at["RDExpen", "STJpapers"]
    ),
    "is different than zero, so we can say there is some degree of linear association between the variables.",
)

In [None]:
print(
    "Now we calculate the correlation strength by means of Pearson's linear correlation coefficient."
)
display(science_data[["RDExpen", "STJpapers"]].corr() ** 2)
print(
    "We can say there is a weak correlation",
    "({:.2f})".format(
        (science_data[["RDExpen", "STJpapers"]].corr() ** 2).at["RDExpen", "STJpapers"]
    ),
)

In [None]:
print(
    "To talk about relationship, we draw a scatterplot and try to identify a function to describe de association."
)
sns.regplot(x="RDExpen", y="STJpapers", data=science_data)
plt.xlabel("RDExpen")
plt.ylabel("STJpapers")
plt.title("Scatter Plot", fontsize=20)

plt.show()

Looking with benevolent eyes, we can see a linear association (we already expected it to be weak).

### 2.4.- <font color = "Blue"> Categoricals (2 Points)</font>

**Using the split of question 1.5, make a graph that shows the relative frequencies of each category. Discuss it.**

In [None]:
# we first create a contingency table
cont_table = pd.crosstab(science_data.RDExpen_cat, science_data.STJpapers_cat).reindex(["low","mid","high"]).reindex(["low","mid","high"],axis=1)

print("Contingency table:")
display(cont_table)
print("")

# and use it to generate a bar plot
print("Bar plot:")
freq = cont_table
freq.plot(kind = 'bar')
plt.show()


We can say that countries that expend lower amounts in R&D obtain low or mid number of papers, never high.<br>
Countries that expend mid amounts in R&D get mid or high number of papers.<br>
And finally, countries that expend higher amounts in R&D get also mid or high amounts of papers.

One could conclude that it pays increasing the R&D budget and get into the mid cluster, the number of papers will be much higher.

## 3.- <font color = "Red"> Probability </font>



### 3.1.- <font color = "Blue"> Depedency (2 Points) </font> 

**For this question you need to split the `RDExpen` and `STJpapers` in three levels denoted by `low`, `mid` and `high` (use the own range of the variables). Then, are the events of `high` level in both variables dependent or independent? Why?**



If both variables are dependent, the probability of the joint event \begin{equation}
P(STJpapers_{high}\cap RDExpen_{high})
\end{equation} is different from 0.

In order to get that information, we can reuse the contingency table calculated in 2.4 above, but normalized.


In [None]:
pd.crosstab(
    science_data.RDExpen_cat, science_data.STJpapers_cat, normalize=True
).reindex(["low", "mid", "high"]).reindex(["low", "mid", "high"], axis=1)["high"][
    "high"
] > 0

Since the value obtained is not 0, we can conclude both variables are dependent.

### 3.2.- <font color = "Blue"> Conditional Probability (2 Points) </font>

**Find the probability that for any randomly chosen country, if the `RDExpen` is not high, the `STJpapers` is high**

To calculate the required probability we use Bayes' formula:
\begin{equation}
P(STJpapers_{high}|RDExpens_{not\_high})=\frac{P(STJpapers_{high}\cap RDExpens_{not\_high})}{P(RDExpens_{not\_high})}
\end{equation}

And we can get both numerator and denominator from the bidimensional distribution (normalized contigency table if we simplify both axis to high and not high):

In [None]:
# here we create two new categorical variables with high or not high values, using the limits calculated before
science_data["RDExpen_cat2"] = np.where(
    np.isnan(science_data.RDExpen),
    None,
    (
        np.where(
            science_data.RDExpen <= science_data.RDExpen.quantile(0.75),
            "not_high",
            "high",
        )
    ),
)
science_data["STJpapers_cat2"] = np.where(
    science_data.STJpapers <= science_data.STJpapers.quantile(0.75), "not_high", "high"
)

# and use these two new categorical variables to compute the bidimensional distribution
bid_dist = (
    pd.crosstab(science_data.RDExpen_cat2, science_data.STJpapers_cat2, normalize=True)
    .reindex(["not_high", "high"])
    .reindex(["not_high", "high"], axis=1)
)
display(bid_dist)

In [None]:
# then we obtain the two values we are looking for. First, the probability of STJpaper being high if RDExpen is not_high
P_STJpapersh_and_RDExpenNH = bid_dist["not_high"]["high"]
# and then the probability of RDExpen being not_high
P_RDExpenNH = bid_dist["not_high"].sum()

# we compute it and obtain the conditional probability we were looking for
P_STJpapersH_cond_RDExpenNH = P_STJpapersh_and_RDExpenNH / P_RDExpenNH
print(
    "The conditional probability of STJpapers being high if RDExpen is not_high is",
    "{:.2%}".format(P_STJpapersH_cond_RDExpenNH),
)

### 3.3.- <font color = "Blue"> Expected Value (2 Points) </font> 

**Assuming that in the world, the research and development expenditure follows a normal distribution with mean 1.5 and standard deviation of 1.1, then find the expected number of countries in the sample that should have an expenditure above 3. Is that number in agreement with what you find in the data? Can you explain?**

In [None]:
# to better understand the problem, let's create a plot of the distribution
limit = 3
mu = 1.5
s = 1.1
minX = mu - 3.5 * s
maxX = mu + 3.5 * s

# we represent our normal distribution with 195 points, as there are 195 countries
x = np.linspace(minX, maxX, 195)
px1 = x[x > limit]
y = norm.pdf(x=x, loc=mu, scale=s)

plt.figure(figsize=(8, 5))
ax1 = plt.subplot()
ax1.plot(x, y)
ax1.fill_between(x=px1, y1=0, y2=norm.pdf(px1, mu, s))
ax1.vlines(x=limit, ymin=0, ymax=norm.pdf(limit, mu, s), linestyle="dotted")
# in order to obtain the probability of the rightmost region,
# we compute the leftmost and substract it from 1
ax1.set_title(
    "p(x >" + str(limit) + ") =" + str(np.round(1 - norm.cdf(limit, mu, s), 3))
)

plt.show()

In [None]:
print(
    f"We have obtained a probability of {np.round(1 - norm.cdf(limit, mu, s), 3)}. Applying it to our sample of",
    science_data.shape[0],
    "countries we obtain \nthat",
    "{:.0f}".format(science_data.shape[0] * np.round(1 - norm.cdf(limit, mu, s), 3)),
    "countries in our sample should be above 3 in RDExpen",
)

Let's check it out.

In [None]:
science_data.loc[science_data.RDExpen >= 3, "RDExpen"].shape[0]

YEAH!!! :-)