Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). Do NOT add any cells to the notebook!

Make sure you fill in any place that says `YOUR CODE HERE` or _YOUR ANSWER HERE_ , as well as your name and group below:

In [None]:
NAME = "Christoph Helmberger"
STUDENTID = "11915039"
GROUPID = "3";

# Assignment 5 (Group)
In Assignment 2, as a group, you trained yourselves in accessing and characterising two data sources. You also sketched out a data-science project based on these data sources. In this assignment, based on this project idea, you should select, implement, and describe 3 appropriate visualisations.

The following materials provide the necessary background:
* the slide deck on visualisations (Unit 5) and the corresponding notebook;
* Chapter 3 of "Data Science from Scratch"
* the supplemental read on "Task-Based Effectiveness of Basic Visualizations" available from MyLearn: _B. Saket, A. Endert and Ç. Demiralp (2019), "Task-Based Effectiveness of Basic Visualizations," in IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 7, pp. 2505-2512, DOI: 10.1109/TVCG.2018.2829750_

Requirements:
* The visualisation should fit the chosen tasks on the data sets.
* You should employ at least two different types of visualisations. Even if two tasks in two steps below were identical (e.g., two aggregation tasks), you would be expected to select a different visualisation for each. 
* As opposed to Assignment 2, you are expected to use pandas to represent and to prepare the data sets for visualisation.
* As for the Assignment 2 data sets, to avoid confusions:
 * Use the genuine ones, not the manipulated ones (having anomalies introduced). 
 * If you worked with excerpts (samples) from the original and genuine datasets, you may continue using these. You are also free to use the complete datasets, but this is not expected
 * Please stick to your project description in Assignment 2 when choosing tasks and corresponding visualisations.

-----
## Step 1 (6 points)

Select, implement and describe one visualisation for data source 1 (in isolation from data source 2).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os.path

plt.rcParams['figure.figsize'] = (15, 9)

df = pd.read_csv('./data/data_notebook-1_dataset1.csv')
df[['year','week']] = df['year_week'].str.split('-', expand = True) #splitting column year_week into year and week

#grouping per year and age groups and calculating the new cases
dfAgg = pd.DataFrame(df.groupby(['age_group','year'])['new_cases'].sum().unstack('year').fillna(0))
dfAgg = dfAgg.reindex(np.roll(dfAgg.index, shift = 1))

#plotting line chart adn setting the title and labels for the axis with the respective font sizes
pl = dfAgg[['2020','2021']].plot.bar(color = ['purple', 'green'])
plt.title('New cases (in Millions) per year and age group in EU', fontsize = 'xx-large')
plt.ylabel('# of new cases (in Millions)', fontsize = 'xx-large')
plt.xlabel('Age groups', fontsize = 'xx-large')

legend = plt.legend(loc = 'upper right', fontsize = 'x-large')
legend.set_title('Year', prop = {'size':15})
plt.show()

Document your decision and describe the resulting visualisation. In your answer, cover the following aspects:

* What is the task on the data source supported by the chosen visualisation?
* Why is the chosen visualisation effective for the given task?
* What does the visualisation show exactly?
* What does the visualisation contribute to answering your project's questions?

Our goal for Step 1 was to visualize the different number of new COVID-19 cases per age group and year. It shows the difference between the number of new cases per 6 different age groups for the year 2020 (purple bar) and 2021(green bar). The task on the data source supported by the chosen visualization is both clustering (per different age groups and years), as well as computing derived value (by aggregation). The chosen visualisation is able to quickly show us the comparison in regards to the new cases for the different age groups and years. The easy interpretability and the fact that it shows everything in a clear manner makes this bar plot a good choice for this task.

The bar plot shows the number of new cases on the y-axis, in million, for every age group which are displayed on the x-axis in the years 2020 and 2021. The year 2020 is displayed in purple bars and the year 2021 is displayed in green bars. This visualisation contributes to our project idea by showing us the difference in new cases in 2 different years, which can be extended to other time frames.

------
## Step 2 (6 points)

Select, implement and describe one visualisation for data source 2 (in isolation from data source 1).

In [None]:
plt.rcParams['figure.figsize'] = (15, 9)

json_df = pd.read_json ('./data/data_notebook-1_dataset2.json') #loading file into dataframe
json_df_new = pd.DataFrame(json_df[json_df['level'] == 'national']).reset_index() #filtering for only national level data
json_df_new = json_df_new.groupby('year_week')['new_cases'].sum().to_frame().reset_index() #calculating the new cases for the different year_weeks
pl = json_df_new.plot.line(x = 'year_week', y = 'new_cases', color = 'red', label = 'New cases', linewidth = 4) #plotting the line graph with chosen x and y

#setting the title and labels for the axis with the respective font sizes
plt.title('New cases (in Millions) per week in EU', fontsize = 'xx-large') #defining title
plt.ylabel('# of new cases (in Millions)', fontsize = 'xx-large') #defining y-axis
plt.xlabel('Year-Week', fontsize = 'xx-large') #defining x-axis


plt.legend(loc = 'upper right', fontsize = 'x-large')
plt.show()

Document your decision and describe the resulting visualisation. In your answer, cover the following aspects:

* What is the task on the data source supported by the chosen visualisation?
* Why is the chosen visualisation effective for the given task?
* What does the visualisation show exactly?
* What does the visualisation contribute to answering your project's questions?

The line chart is a commonly used plot to show how data evolved over time, which is why it is very useful for what we are trying to achieve. The task on the data source supported by the chosen visualization is computing derived value (by aggregating the number of new cases per all countries in the European Union). Our goal was to see how the total number of new cases (on y-axis) in the European Union evolved over the different weeks (on x-axis) the data was recorded (starting from 2020 week 1). We can clearly see, that the number of new cases per week have risen steadily until they reached a peak around week 45 of 2020. The number of new cases have since, in general, gone down slightly but are greatly fluctuating.

This visualisation contributes to our project idea as it shows us the total number of new cases over time, which can then be used to look at the different age groups.

-----
## Step 3  (6 points)

Merge the two data sets (or, relevant subsets thereof) based on your project idea from Assignment 2. Select, implement and describe one visualisation on the combined data set. Make sure you visualize variables taken from both original data sets.

In [None]:
df_json = pd.read_json ('./data/data_notebook-1_dataset2.json')
df_csv = pd.read_csv("./data/data_notebook-1_dataset1.csv")
df_csv[['year','week']] = df_csv['year_week'].str.split('-', expand = True)

new_json = pd.DataFrame(df_json[df_json['level'] == 'national']).reset_index()
new_json = new_json.groupby('year_week')['tests_done'].sum().to_frame().reset_index()
new_json[['year','week']] = new_json['year_week'].str.split('-W', expand = True)

dfAgg2 = df_csv.groupby(['age_group','year_week'])['new_cases'].sum().to_frame().reset_index()
dfAgg2[['year','week']] = dfAgg2['year_week'].str.split('-', expand = True)

dfJoin = dfAgg2.merge(new_json, left_on = ['year', 'week'], right_on= ['year', 'week']).reset_index()
dfJoin['week'] = dfJoin['week'].astype(int)
del dfJoin['index']

temp_df = pd.DataFrame(dfJoin[dfJoin['year'] == '2021']).reset_index()

temp_df['rate_per_age_group'] = None
rate_ind = temp_df.columns.get_loc('rate_per_age_group')
new_cases_ind = temp_df.columns.get_loc('new_cases')
tests_done_ind = temp_df.columns.get_loc('tests_done')
row_count = temp_df.shape[0]

#calculating the positivity rate from the merged dataframe and storing it in a new column in the same dataframe
for row in range(0, row_count):
    rate = (temp_df.iat[row, new_cases_ind]*100)/temp_df.iat[row, tests_done_ind]
    temp_df.iat[row, rate_ind] = rate

#deleting the unnecessary columns
del temp_df['year_week_y']
del temp_df['index']

temp_df = temp_df.rename(columns={'year_week_x': 'year_week'})
plot_df = temp_df[["year_week", "age_group","rate_per_age_group"]].reset_index()

del plot_df['index']

plt.rcParams['figure.figsize'] = (15, 9)

list_unique = plot_df['age_group'].unique()

#put the '<15yr' to the front to have it all in order
list_unique = np.roll(list_unique, shift = 1)

x = plot_df['year_week']
y = plot_df['rate_per_age_group']

for i in range(len(list_unique)):
    idx = plot_df['age_group'] == list_unique[i]
    plt.plot(x[idx], y[idx], label=list_unique[i], linewidth = 3)

#list1 is a list of all unique weeks in 2021 - W1-W42
list1 = list()
for year_week in plot_df[plot_df['year_week'].index % 4 == 0]['year_week'].tolist():
    if year_week not in list1:
        list1.append(year_week)

#list2 is used for a rough estimate of starting weeks for each month in 2021 - 4 weeks = 1 month
list2 = list()
for elem in list1:
    if list1.index(elem) % 4 == 0:
        list2.append(elem)
        
plt.xlabel('Month-Year', fontsize = 'xx-large')
plt.ylabel('% of positive tests per age group', fontsize = 'xx-large')
plt.title('Positive COVID-19 test rate per age group for year 2021 (Jan-Nov)', fontsize = 'xx-large')

# xticks method is used to name the rough month starting weeks from list2 to Month-Year format
plt.xticks(ticks = list2, labels = ['Jan-2021', 'Feb-2021', 'Mar-2021', 'Apr-2021', 'May-2021', 'Jun-2021', 'Jul-2021', 'Aug-2021', 'Sep-2021', 'Oct-2021', 'Nov-2021'])
legend = plt.legend(loc = 'upper right', fontsize = 'x-large')
legend.set_title('Age groups', prop = {'size':15})
plt.show()

Document your decision and describe the resulting visualisation. In your answer, cover the following aspects:

* What is the task on the combined data set supported by the chosen visualisation?
* Why is the chosen visualisation effective for the given task?
* What does the visualisation show exactly?
* What does the visualisation contribute to answering your project's questions?

The goal of this visualisation was to show how many new cases in percent (in regards to the total tests done) there are in different age groups and to compare the different age groups to each other over the year 2021 (from January until November). The task on the data source supported by the chosen visualization is computing derived value (by deriving the positivity rate per age group and for year 2021), as well as clustering by deriving the positivity rate for specific age groups. The line chart is an effective choice for this goal as it allows us to clearly see the mentioned differences over time.

The visualisation shows a line chart that displays how many percent the positive tests of a certain age group are in regards to total tests done. The y-axis shows the percent of positive tests in regards to the total tests done, the x-axis shows the months January of 2021 until November 2021. The different coloured lines show the different age groups. We can see that the age group 25-49 has consistently had the highest number of new cases and that overall the least tests were positive at the beginning of July 2021, after which they have risen again. This visualisation helps us in answering the question how the cases, in regard to total tests, of the different age groups have changed over the time of a few months.

-----
## Step 4  (2 points)

Persist the merged dataset from Step 3 as a file.

In [None]:
#file of the dataset we used to plot the final visualization in Step 3
plot_df.to_csv('./data/data_notebook-1_plot_dataset.csv', index = False, sep = ',')