In [1]:
import numpy as np
import pandas as pd

#### 1. Download the file covid-scandinavia.csv.bz2 from canvas and load it into a pandas dataframe.

In [2]:
covid = pd.read_csv('covid-scandinavia.csv.bz2', sep="\t")
print(covid)

     code2  country  state        date       type    count    lockdown  \
0       DK  Denmark    NaN  2020-01-22  Confirmed        0  2020-03-11   
1       DK  Denmark    NaN  2020-01-22     Deaths        0  2020-03-11   
2       DK  Denmark    NaN  2020-01-23  Confirmed        0  2020-03-11   
3       DK  Denmark    NaN  2020-01-23     Deaths        0  2020-03-11   
4       DK  Denmark    NaN  2020-01-24  Confirmed        0  2020-03-11   
...    ...      ...    ...         ...        ...      ...         ...   
6523    SE   Sweden    NaN  2022-04-14     Deaths    18605         NaN   
6524    SE   Sweden    NaN  2022-04-15  Confirmed  2495996         NaN   
6525    SE   Sweden    NaN  2022-04-15     Deaths    18605         NaN   
6526    SE   Sweden    NaN  2022-04-16  Confirmed  2495996         NaN   
6527    SE   Sweden    NaN  2022-04-16     Deaths    18605         NaN   

      population   countPC  growth  growthPC  
0        5837213  0.000000     NaN       NaN  
1        5837213 

#### 2. It’s time to get to know your data! Report the number of rows and columns in the dataset.

In [3]:
print(covid.shape)

(6528, 11)


#### 3. What variables does this dataset have? Report the variable names along with the data type of each variable.

In [4]:
covid.dtypes

code2          object
country        object
state         float64
date           object
type           object
count           int64
lockdown       object
population      int64
countPC       float64
growth        float64
growthPC      float64
dtype: object

#### 4. A number of these variables we do not need. Create a new sub-dataframe that contains all observations but only variables country, date, type, count, and population.

In [5]:
vars = ["country", "date", "type", "count", "population"]
covidCondensed = covid[vars].copy()
covidCondensed

Unnamed: 0,country,date,type,count,population
0,Denmark,2020-01-22,Confirmed,0,5837213
1,Denmark,2020-01-22,Deaths,0,5837213
2,Denmark,2020-01-23,Confirmed,0,5837213
3,Denmark,2020-01-23,Deaths,0,5837213
4,Denmark,2020-01-24,Confirmed,0,5837213
...,...,...,...,...,...
6523,Sweden,2022-04-14,Deaths,18605,10377781
6524,Sweden,2022-04-15,Confirmed,2495996,10377781
6525,Sweden,2022-04-15,Deaths,18605,10377781
6526,Sweden,2022-04-16,Confirmed,2495996,10377781


#### 1. Filter the data frame you created above and keep only the confirmed cases.
How many cases do you have in the subset?

In [6]:
confirmed = covidCondensed[covidCondensed.type == "Confirmed"].copy()
print(confirmed)

      country        date       type    count  population
0     Denmark  2020-01-22  Confirmed        0     5837213
2     Denmark  2020-01-23  Confirmed        0     5837213
4     Denmark  2020-01-24  Confirmed        0     5837213
6     Denmark  2020-01-25  Confirmed        0     5837213
8     Denmark  2020-01-26  Confirmed        0     5837213
...       ...         ...        ...      ...         ...
6518   Sweden  2022-04-12  Confirmed  2491980    10377781
6520   Sweden  2022-04-13  Confirmed  2491980    10377781
6522   Sweden  2022-04-14  Confirmed  2495996    10377781
6524   Sweden  2022-04-15  Confirmed  2495996    10377781
6526   Sweden  2022-04-16  Confirmed  2495996    10377781

[3264 rows x 5 columns]


#### 2. Create a new variable, confirmed cases per capita, by dividing the confirmed cases’ count by population.

In [7]:
confirmed["confirmed cases per capita"] = confirmed["count"]/confirmed["population"]
print(confirmed)

      country        date       type    count  population  \
0     Denmark  2020-01-22  Confirmed        0     5837213   
2     Denmark  2020-01-23  Confirmed        0     5837213   
4     Denmark  2020-01-24  Confirmed        0     5837213   
6     Denmark  2020-01-25  Confirmed        0     5837213   
8     Denmark  2020-01-26  Confirmed        0     5837213   
...       ...         ...        ...      ...         ...   
6518   Sweden  2022-04-12  Confirmed  2491980    10377781   
6520   Sweden  2022-04-13  Confirmed  2491980    10377781   
6522   Sweden  2022-04-14  Confirmed  2495996    10377781   
6524   Sweden  2022-04-15  Confirmed  2495996    10377781   
6526   Sweden  2022-04-16  Confirmed  2495996    10377781   

      confirmed cases per capita  
0                       0.000000  
2                       0.000000  
4                       0.000000  
6                       0.000000  
8                       0.000000  
...                          ...  
6518                  

#### 3. Try to use the variable “count” with dot as data.count/data.population But you may get an assertion error. Can you explain why that happens? And how can you get around of the problem?
#### 4. When you have fixed this then you may notice a warning: “A value is trying to be set on a copy of a slice...”. Why do we get this warning? How can we get rid of it? Repeat the previous steps in a way that you do not get the warning

The assertion error happens because the "count" is a keyword for a function. We can work around this by using brackets. The slice error happens by not explicitly creating a copy of the dataframe, which can be worked around with the .copy() function.

#### 5. Now extract the cases per capita variable your created for Sweden and Norway. You should create two new workspace variables and assign the corresponding columns in those.

In [8]:
perCapitaSweden = confirmed[confirmed.country == "Sweden"].copy()["confirmed cases per capita"]
perCapitaNorway = confirmed[confirmed.country == "Norway"].copy()["confirmed cases per capita"]

#### 6. What is the data structure of the newly created variables?

In [9]:
print(type(perCapitaSweden))
print(type(perCapitaNorway))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


Series

#### 7. What is index of these series’? What do you think, is it a useful index? What might be a better index? Why?

In [10]:
print(perCapitaSweden.index)
print(perCapitaNorway.index)

Int64Index([4896, 4898, 4900, 4902, 4904, 4906, 4908, 4910, 4912, 4914,
            ...
            6508, 6510, 6512, 6514, 6516, 6518, 6520, 6522, 6524, 6526],
           dtype='int64', length=816)
Int64Index([3264, 3266, 3268, 3270, 3272, 3274, 3276, 3278, 3280, 3282,
            ...
            4876, 4878, 4880, 4882, 4884, 4886, 4888, 4890, 4892, 4894],
           dtype='int64', length=816)


#### 8. Replace the original index with one you think is better. Show that it worked (you may just want to print a few lines).

The index of these series are just integers for the row number. A useful index is something that can help the user know what information each row is containing. A useful index for this can be the day number starting from Jan 22 or the date, since the data for each country starts from Jan 22.

In [11]:
perCapitaSweden.index = confirmed[confirmed.country == "Sweden"].date
perCapitaNorway.index = confirmed[confirmed.country == "Norway"].date

print(perCapitaSweden.sample(3))
print(perCapitaNorway.sample(3))

date
2020-06-04    0.004076
2020-11-17    0.018543
2022-01-25    0.185299
Name: confirmed cases per capita, dtype: float64
date
2021-02-03    0.011856
2021-09-22    0.034285
2022-02-25    0.226073
Name: confirmed cases per capita, dtype: float64


#### 1. Now compare the number of confirmed cases per capita in Sweden and Norway at three different time points: 2020-05-01, 2020-12-01, 2021-07-01 and 2022-01-01.

In [12]:
dates = ["2020-05-01", "2020-12-01", "2021-07-01", "2022-01-01"]
print(perCapitaSweden.loc[dates])
print(perCapitaNorway.loc[dates])

date
2020-05-01    0.002133
2020-12-01    0.025127
2021-07-01    0.105085
2022-01-01    0.126692
Name: confirmed cases per capita, dtype: float64
date
2020-05-01    0.001445
2020-12-01    0.006796
2021-07-01    0.024423
2022-01-01    0.073620
Name: confirmed cases per capita, dtype: float64
