# Workshop 1

Here you're gonna test your __data scientist junior__ skills. Read carefully each one of the problems, write your own test cases, and validate everything works as expected.

## 1. Regular Expressions

As follows complete the code based on the __requirement__. There is a part `#YOUR CODE HERE` where you _should complete_ to accomplish the task. However, you _could_ change anything you want.

### Problem 1.1

Find a list of all of all of the names in the following string using _regex_.

In [6]:
import re
from typing import List

def names() -> List[str]:
    """
    Extracts capitalized names from a given string.

    This function searches for and returns all substrings in the input 
    string that start with a capital letter followed by lowercase letters.

    Returns:
        List[str]: A list of names found in the string.
    """
    simple_string = """Amy is 5 years old, and her sister Mary is 2 years old. 
    Ruth and Peter, their parents, have 3 kids."""
    
    pattern = '[A-Z][a-z]*'
    return re.findall(pattern, simple_string)

In [7]:
# example of test case
result=names()
assert len(result) == 4, "There are four names in the simple_string."
expected_names = ["Amy", "Mary", "Ruth", "Peter"]
assert result == expected_names, f"The result should be {expected_names}, but got {result}."
print(result)


### Problem 1.2

The _dataset file_ in [assets/grades.txt](./assets/grades.txt) contains a line separated _list of people_ with their __grade__ in a class. Create a _regex_ to generate a list of just those students who received a __B__ in the course.

In [8]:
import re
from typing import List

def grades() -> List[str]:
    """
    Extracts names of students who received a grade of 'B' from a grades file.

    This function reads a text file containing grades and returns the names of 
    students who received a grade of 'B'. Each name is expected to be in the format 
    'First Last'.

    Returns:
        List[str]: A list of names of students with a grade of 'B'.
    """
    with open("assets/grades.txt", "r") as file:
        grades = file.read()
    
    pattern = '([A-Z][a-z]+ [A-Z][a-z]+): B'
    grades_list = re.findall(pattern, grades)
    
    return grades_list


In [25]:
# example of test case
print(grades())
assert len(grades()) == 16


['Bell Kassulke', 'Simon Loidl', 'Elias Jovanovic', 'Hakim Botros', 'Emilie Lorentsen', 'Jake Wood', 'Fatemeh Akhtar', 'Kim Weston', 'Yasmin Dar', 'Viswamitra Upandhye', 'Killian Kaufman', 'Elwood Page', 'Elodie Booker', 'Adnan Chen', 'Hank Spinka', 'Hannah Bayer']


### Problem 1.3

Consider the standard _web log file_ in [assets/logdata.txt](./assets/logdata.txt). This _file_ records the _access_ a user makes when visiting a web page. Each __line of the log__ has the following _items_:

- a __host__ (e.g., `146.204.224.152`)
- a __user_name__ (e.g., `feest6811`. _Hint:_ sometimes the user name is missing! In this case, use `-` as the value for the username.)
- the __time__ a request was made (e.g., `21/Jun/2019:15:45:24 -0700`)
- the post __request type__ (e.g., `POST /incentivize HTTP/1.1`. _Note:_ not everything is a POST!)

Your task is to convert this into a list of dictionaries, where each dictionary looks like the following:

```python
example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}
```

In [12]:
import re
from typing import List, Dict

def logs() -> List[Dict[str, str]]:
    """
    Extracts structured log data from a log file.

    This function reads a log file and uses a regular expression to extract 
    specific fields from each log entry, returning them as a list of dictionaries.

    Returns:
        List[Dict[str, str]]: A list of dictionaries, each containing extracted log data.
    """
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    
    pattern = """
    (?P<host>[\d\.]+)         # Host IP
    (\ -\ )                   # Separator 1
    (?P<user_name>\S+|\-)     # The User name doesn't have a space
    (\s\[)                    # Space and brackets for the time delimitation
    (?P<time>[^\]]+)          # Time of requisition
    (\]\s\")                  # Separator of Requisition
    (?P<request>.*?)          # Request Specs
    ("\s)                     # Separator before status code
    """

    result = []
    for item in re.finditer(pattern, logdata, re.VERBOSE):
        result.append(item.groupdict())
    
    return result


In [13]:
# example of test case
one_item = {
    "host": "146.204.224.152",
    "user_name": "feest6811",
    "time": "21/Jun/2019:15:45:24 -0700",
    "request": "POST /incentivize HTTP/1.1",
}
assert (
    one_item in logs()
), "Sorry, this item should be in the log results, check your formating"
print(logs())

# 2. Descriptive Analysis

For this section, you'll be looking at _2017 data on immunizations_ from the _CDC_. Your _datafile_ for next tasks is in [assets/NISPUF17.csv](./assets/NISPUF17.csv). A _data users guide_ for this, which you'll need to map the variables in the data to the questions being asked, is available at [assets/NIS-PUF17-DUG.pdf](./assets/NIS-PUF17-DUG.pdf).

# Problem 2.1

Write a function called _proportion\_of\_education_ which returns the proportion of __children__ in the dataset who had a mother with the education levels equal to less than high school ($<12$), high school ($12$), more than high school but not a college graduate ($>12$) and _college degree_.

This _function_ should return a __dictionary__ in the form of (use the _correct numbers_, do not round numbers):

```python
{
    "less than high school": 0.2,
    "high school": 0.4,
    "more than high school but not college": 0.2,
    "college": 0.2
}
```

In [15]:

import pandas as pd
import numpy as np
from typing import Dict

def proportion_of_education() -> Dict[str, float]:
    """
    Calculates the proportion of different education levels in a dataset.

    This function reads a dataset from a CSV file, calculates the proportion of 
    individuals with different levels of education, and returns these proportions 
    as a dictionary.

    Returns:
        Dict[str, float]: A dictionary with education levels as keys and their 
                          respective proportions as values.
    """
    df = pd.read_csv("assets/NISPUF17.csv", index_col=0)
    
    # Extract the education column
    edus = df['EDUC1'].values
    
    # Dictionary to store the proportions of each education level
    poe = {
        "less than high school": 0.0,
        "high school": 0.0,
        "more than high school but not college": 0.0,
        "college": 0.0
    }
    
    n = len(edus)
    
    # Calculate the proportion of each education level
    poe["less than high school"] = np.sum(edus == 1) / n
    poe["high school"] = np.sum(edus == 2) / n
    poe["more than high school but not college"] = np.sum(edus == 3) / n
    poe["college"] = np.sum(edus == 4) / n
    
    return poe


In [21]:
# example of test cases
result=proportion_of_education()
assert type(result) == type({}), "You must return a dictionary."
assert (
    len(result) == 4
), "You have not returned a dictionary with four items in it."
expected_keys=['less than high school', 'high school', 'more than high school but not college', 'college']
assert (list(result.keys())==expected_keys),"you are not using the expected keys in your dictionary"
print(result)


## Problem 2.2

Let's explore the relationship between being _fed breastmilk_ as a child and getting a seasonal _influenza vaccine_ from a healthcare provider. Return a __tuple__ of the _average number of influenza vaccines_ for those children we know received breastmilk as a child and those who know did not.

This _function_ should return a __tuple__ in the form (use the _correct numbers_):

```python
(2.5, 0.1)
```

In [23]:
import pandas as pd
from typing import Tuple

def average_influenza_doses() -> Tuple[float, float]:
    """
    Calculates the average number of influenza doses for breastfed and non-breastfed children.

    This function reads a dataset from a CSV file, computes the average number of 
    influenza doses for children who were breastfed and those who were not, and returns 
    these averages as a tuple.

    Returns:
        Tuple[float, float]: A tuple containing the average number of influenza doses 
                             for breastfed children and non-breastfed children.
    """
    df = pd.read_csv('assets/NISPUF17.csv', index_col=0)
    
    # Filter data for breastfed children
    BF_Flu = df[df['CBF_01'] == 1]
    avg_BF = BF_Flu['P_NUMFLU'].mean()
    
    # Filter data for non-breastfed children
    NBF_Flu = df[df['CBF_01'] == 2]
    avg_NBF = NBF_Flu['P_NUMFLU'].mean()
    
    # Create a tuple with the averages
    result = (avg_BF, avg_NBF)
    return result



In [24]:
# example of test cases
result=average_influenza_doses()
assert (
    len(result) == 2
), "Return two values in a tuple, the first for yes and the second for no."
assert result[0] >= 0, "The average number of doses for breastfed children should be non-negative."
assert result[1] >= 0, "The average number of doses for non-breastfed children should be non-negative."
print(result)


## Problem 2.3

It would be interesting to see if there is any evidence of a link between _vaccine effectiveness_ and _sex of the child_. Calculate the _ratio of the number of children_ who contracted __chickenpox__ but _were vaccinated against it_ (at least one varicella dose) versus those who were vaccinated but did not contract _chicken pox_. Return results by _sex_.

This _function_ should return a __dictionary__ in the form of (use the _correct numbers_):

```python
{
    "male":0.2,
    "female":0.4
}
```

_Note:_ To aid in verification, the `chickenpox_by_sex()['female']` value the autograder is looking for starts with the digits `0.0077`.


In [25]:
import pandas as pd
from typing import Dict

def chickenpox_by_sex() -> Dict[str, float]:
    """
    Calculates the ratio of children who had chickenpox to those who didn't, 
    separated by sex, among those who received at least one dose of varicella vaccine.

    This function reads a dataset from a CSV file, filters the data for children 
    who received at least one dose of the varicella vaccine, and computes the ratio 
    of children who had chickenpox to those who did not for both males and females.

    Returns:
        Dict[str, float]: A dictionary with the ratios for males and females.
    """
    df = pd.read_csv('assets/NISPUF17.csv')
    
    # Filter data for children who received at least one dose of varicella vaccine
    c_vaccinated = df[df['P_NUMVRC'] > 0]
    
    # Calculate ratio for males
    men_stats = c_vaccinated[c_vaccinated['SEX'] == 1]
    m_no_cpox = len(men_stats[men_stats['HAD_CPOX'] == 2])
    men_ratio = len(men_stats[men_stats['HAD_CPOX'] == 1]) / m_no_cpox
    
    # Calculate ratio for females
    women_stats = c_vaccinated[c_vaccinated['SEX'] == 2]
    w_no_cpox = len(women_stats[women_stats['HAD_CPOX'] == 2])
    women_ratio = len(women_stats[women_stats['HAD_CPOX'] == 1]) / w_no_cpox
    
    # Store the ratios in a dictionary
    ratios = {'male': men_ratio, 'female': women_ratio}
    
    return ratios


In [26]:
result=chickenpox_by_sex()
assert result['male'] >= 0, "The ratio for 'male' should be non-negative."
assert result['female'] >= 0, "The ratio for 'female' should be non-negative."
print(result)

## Problem 2.4

A __correlation__ is a _statistical relationship_ between two variables. If we wanted to know _if vaccines work_, we might look at the correlation between the use of the vaccine and whether it results in prevention of the infection or disease. In this task, you are to see if there is a correlation between _having had the chicken pox_ and the _number of chickenpox vaccine doses given_ (varicella).

Some notes on interpreting the answer. The `had_chickenpox_column` is either $1$ (for _yes_) or $2$ (for _no_), and the `num_chickenpox_vaccine_column` is the number of doses a child has been given of the varicella vaccine. A _positive correlation_ (e.g., $corr > 0$) means that an increase in _had\_chickenpox\_column_ (which means more _no_’s) would also increase the values of _num\_chickenpox\_vaccine\_column_ (which means _more doses of vaccine_). If there is a _negative correlation_ (e.g., $corr < 0$), it indicates that having had chickenpox is related to an increase in the number of vaccine doses.

Also, $pval$ is the probability that we observe a correlation between _had\_chickenpox\_column_ and _num\_chickenpox\_vaccine\_column_ which is greater than or equal to a particular value occurred by chance. A _small pval_ means that the observed correlation is highly unlikely to occur by chance. In this case, _pval_ should be very small (will end in $e-18$ indicating a very small number).

In [27]:
import scipy.stats as stats
import pandas as pd
def corr_chickenpox() -> float:
    """
    Calculates the Pearson correlation coefficient between having had chickenpox and 
    the number of varicella vaccine doses received.

    This function reads a dataset from a CSV file, filters the data to include only 
    valid entries for having had chickenpox (HAD_CPOX) and the number of varicella 
    vaccine doses (P_NUMVRC), and calculates the Pearson correlation coefficient 
    between these two variables.

    Returns:
        float: The Pearson correlation coefficient.
    """
    df = pd.read_csv('assets/NISPUF17.csv')
    dfi = df[["HAD_CPOX", "P_NUMVRC"]].copy().dropna()
    dfnew = dfi[dfi["HAD_CPOX"] <= 2]
    corr, pval = stats.pearsonr(dfnew["HAD_CPOX"], dfnew["P_NUMVRC"])
    return corr

In [28]:
# example of test cases
assert (
    -1 <= corr_chickenpox() <= 1
), "You must return a float number between -1.0 and 1.0."
print (corr_chickenpox())

0.07044873460147985
