# Let us continue - This is Part 2

#### TODO:

Import pandas as pd and read the csv file again

In [1]:
import pandas as pd

In [2]:
covid_df = pd.read_csv('italy-covid-daywise.csv')

#### TODO: From previous part

sort_values() of 'new_cases' and print the first 10 rows. You don't need to set the parameter ascending

In [3]:
covid_df.sort_values('new_cases').head(10)

Unnamed: 0,date,new_cases,new_deaths,new_tests
119,2020-06-20,-148.0,47.0,29875.0
0,2020-02-22,14.0,0.0,
2,2020-02-24,53.0,0.0,
1,2020-02-23,62.0,2.0,
5,2020-02-27,78.0,1.0,
4,2020-02-26,93.0,5.0,
3,2020-02-25,97.0,4.0,
123,2020-06-24,113.0,18.0,30237.0
144,2020-07-15,114.0,17.0,28392.0
129,2020-06-30,126.0,6.0,28471.0


It seems like the count of new cases on Jun 20, 2020, was `-148`, a negative number! Not something we might have expected, but that's the nature of real-world data. It could be a data entry error, or the government may have issued a correction to account for miscounting in the past. Can you dig through news articles online and figure out why the number was negative?

#### TODO:

Let's look at some days before and after Jun 20, 2020.

In [4]:
covid_df.loc[115:125]

Unnamed: 0,date,new_cases,new_deaths,new_tests
115,2020-06-16,301.0,26.0,27762.0
116,2020-06-17,210.0,34.0,33957.0
117,2020-06-18,328.0,43.0,32921.0
118,2020-06-19,331.0,66.0,28570.0
119,2020-06-20,-148.0,47.0,29875.0
120,2020-06-21,264.0,49.0,24581.0
121,2020-06-22,224.0,24.0,16152.0
122,2020-06-23,221.0,23.0,23225.0
123,2020-06-24,113.0,18.0,30237.0
124,2020-06-25,577.0,-31.0,29421.0


For now, let's assume this was indeed a data entry error. We can use one of the following approaches for dealing with the faulty value:
1. Replace it with `0`.
2. Replace it with the average of the entire column
3. Replace it with the average of the values on the previous & next date
4. Discard the row entirely

#### TODO:

Which approach you pick requires some context about the data and the problem. In this case, since we are dealing with data ordered by date, we can go ahead with the third approach.

You can use the `.at` method to modify a specific value within the dataframe.

In [5]:
covid_df.at[119, 'new_cases'] = (covid_df.at[118, 'new_cases'] + covid_df.at[120, 'new_cases'])/2

## Missing data

As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While `NaN` is the default missing value marker for reasons of computational speed and convenience, we need to be able to easily detect this value with data of different types: floating point, integer, boolean, and general object.

In [6]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2020-02-22,14.0,0.0,
1,2020-02-23,62.0,2.0,
2,2020-02-24,53.0,0.0,
3,2020-02-25,97.0,4.0,
4,2020-02-26,93.0,5.0,
...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0
191,2020-08-31,1365.0,4.0,42583.0
192,2020-09-01,996.0,6.0,54395.0
193,2020-09-02,975.0,8.0,


#### TODO:

Get the total of all NaN values in every column

In [7]:
#  GIve the number of NaN values in every column
covid_df.isnull().sum()

date           0
new_cases      0
new_deaths     0
new_tests     60
dtype: int64

We can use one of the following approaches for dealing with the missing value:
1. Given a large dataset, if a column contains  more than 15% missing values, delete the corresponding column
2. Given a large dataset, if there is only very few observation, delete the observation and keep the column
3. Replace the missing values with the following aggregate values: Mean, Mode, Median, constant Value (U)

#### TODO

Drop all NaN. set inplace to False. Set inplace to True and see the difference

In [8]:
# To drop any rows that have missing data.
covid_df.dropna(inplace=False)

Unnamed: 0,date,new_cases,new_deaths,new_tests
58,2020-04-20,3047.0,433.0,7841.0
59,2020-04-21,2256.0,454.0,28095.0
60,2020-04-22,2729.0,534.0,44248.0
61,2020-04-23,3370.0,437.0,37083.0
62,2020-04-24,2646.0,464.0,95273.0
...,...,...,...,...
188,2020-08-28,1409.0,5.0,65135.0
189,2020-08-29,1460.0,9.0,64294.0
190,2020-08-30,1444.0,1.0,53541.0
191,2020-08-31,1365.0,4.0,42583.0


https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

## Working with dates

While we've looked at overall numbers for the cases, tests, positive rate, etc., it would also be useful to study these numbers on a month-by-month basis. The `date` column might come in handy here, as Pandas provides many utilities for working with dates.

#### TODO:

Check the columns' data types

In [9]:
covid_df.dtypes

date           object
new_cases     float64
new_deaths    float64
new_tests     float64
dtype: object

#### TODO:

access the date column

In [10]:
covid_df.date

0      2020-02-22
1      2020-02-23
2      2020-02-24
3      2020-02-25
4      2020-02-26
          ...    
190    2020-08-30
191    2020-08-31
192    2020-09-01
193    2020-09-02
194    2020-09-03
Name: date, Length: 195, dtype: object

The data type of date is currently `object`, so Pandas does not know that this column is a date. We can convert it into a `datetime` column using the `pd.to_datetime` method.

#### TODO:

Convert the date column to datetime date type

In [11]:
covid_df['date'] = pd.to_datetime(covid_df.date)

In [12]:
covid_df['date']

0     2020-02-22
1     2020-02-23
2     2020-02-24
3     2020-02-25
4     2020-02-26
         ...    
190   2020-08-30
191   2020-08-31
192   2020-09-01
193   2020-09-02
194   2020-09-03
Name: date, Length: 195, dtype: datetime64[ns]

You can see that it now has the datatype `datetime64`. We can now extract different parts of the data into separate columns, using the `DatetimeIndex` class ([view docs](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DatetimeIndex.html)).

In [13]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2020-02-22,14.0,0.0,
1,2020-02-23,62.0,2.0,
2,2020-02-24,53.0,0.0,
3,2020-02-25,97.0,4.0,
4,2020-02-26,93.0,5.0,
...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0
191,2020-08-31,1365.0,4.0,42583.0
192,2020-09-01,996.0,6.0,54395.0
193,2020-09-02,975.0,8.0,


#### TODO:

Create 4 new columns.

1. year column
2. month column
3. day column
4. weekday column

In [14]:
covid_df['year'] = pd.DatetimeIndex(covid_df.date).year
covid_df['month'] = pd.DatetimeIndex(covid_df.date).month
covid_df['day'] = pd.DatetimeIndex(covid_df.date).day
covid_df['weekday'] = pd.DatetimeIndex(covid_df.date).weekday

In [15]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests,year,month,day,weekday
0,2020-02-22,14.0,0.0,,2020,2,22,5
1,2020-02-23,62.0,2.0,,2020,2,23,6
2,2020-02-24,53.0,0.0,,2020,2,24,0
3,2020-02-25,97.0,4.0,,2020,2,25,1
4,2020-02-26,93.0,5.0,,2020,2,26,2
...,...,...,...,...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0,2020,8,30,6
191,2020-08-31,1365.0,4.0,42583.0,2020,8,31,0
192,2020-09-01,996.0,6.0,54395.0,2020,9,1,1
193,2020-09-02,975.0,8.0,,2020,9,2,2


Let's check the overall metrics for May. We can query the rows for May, choose a subset of columns, and use the `sum` method to aggregate each selected column's values.

#### TODO:

Query the rows for May

In [16]:
# 
covid_df_may = covid_df[covid_df.month == 5]
covid_df_may

Unnamed: 0,date,new_cases,new_deaths,new_tests,year,month,day,weekday
69,2020-05-01,1872.0,285.0,43732.0,2020,5,1,4
70,2020-05-02,1965.0,269.0,31231.0,2020,5,2,5
71,2020-05-03,1900.0,474.0,27047.0,2020,5,3,6
72,2020-05-04,1389.0,174.0,22999.0,2020,5,4,0
73,2020-05-05,1221.0,195.0,32211.0,2020,5,5,1
74,2020-05-06,1075.0,236.0,37771.0,2020,5,6,2
75,2020-05-07,1444.0,369.0,13665.0,2020,5,7,3
76,2020-05-08,1401.0,274.0,45428.0,2020,5,8,4
77,2020-05-09,1327.0,243.0,36091.0,2020,5,9,5
78,2020-05-10,1083.0,194.0,31384.0,2020,5,10,6


#### TODO:

Extract the subset of columns (new_cases, new_deaths, new_tests) during the month of may to perform aggregation on them later

In [17]:
# Extract the subset of columns to be aggregated
covid_df_may_metrics = covid_df_may[['new_cases', 'new_deaths', 'new_tests']]
covid_df_may_metrics 

Unnamed: 0,new_cases,new_deaths,new_tests
69,1872.0,285.0,43732.0
70,1965.0,269.0,31231.0
71,1900.0,474.0,27047.0
72,1389.0,174.0,22999.0
73,1221.0,195.0,32211.0
74,1075.0,236.0,37771.0
75,1444.0,369.0,13665.0
76,1401.0,274.0,45428.0
77,1327.0,243.0,36091.0
78,1083.0,194.0,31384.0


### TODO:

Sum the values in the above-mentioned columns. Store the results into a variable called covid_may_totals

In [18]:
# Get the column-wise sum
covid_may_totals = covid_df_may_metrics.sum()
covid_may_totals

new_cases       29073.0
new_deaths       5658.0
new_tests     1078720.0
dtype: float64

### TODO:

Check the type of the variable covid_may_totals

In [19]:
type(covid_may_totals)

pandas.core.series.Series

#### TODO:

Combine the above operations into a single statement.

In [20]:
covid_df[covid_df.month == 5][['new_cases', 'new_deaths', 'new_tests']].sum()

new_cases       29073.0
new_deaths       5658.0
new_tests     1078720.0
dtype: float64

As another example, let's check if the number of cases reported on Sundays is higher than the average number of cases reported every day. This time, we might want to aggregate columns using the `.mean` method.

#### TODO:

Get the average of new_cases at every day (The overall average)

In [21]:
# Overall average
covid_df.new_cases.mean()

1394.6538461538462

#### TODO:

Get the average of new_cases on Sunday

In [22]:
# Average for Sundays
covid_df[covid_df.weekday == 6].new_cases.mean()

1559.0714285714287

It seems like more cases were reported on Sundays compared to other days.

## Grouping and aggregation

As a next step, we might want to summarize the day-wise data and create a new dataframe with month-wise data. We can use the `groupby` function to create a group for each month, select the columns we wish to aggregate, and aggregate them using the `sum` method. 

#### TODO:

Group new_cases, new_deaths, and new_tests by month

In [23]:
covid_month_df = covid_df.groupby('month')[['new_cases', 'new_deaths', 'new_tests']].sum()

In [24]:
covid_month_df

Unnamed: 0_level_0,new_cases,new_deaths,new_tests
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,885.0,21.0,0.0
3,100851.0,11570.0,0.0
4,101852.0,16091.0,419591.0
5,29073.0,5658.0,1078720.0
6,8217.5,1404.0,830354.0
7,6722.0,388.0,797692.0
8,21060.0,345.0,1098704.0
9,3297.0,20.0,54395.0


The result is a new data frame that uses unique values from the column passed to `groupby` as the index. Grouping and aggregation is a powerful method for progressively summarizing data into smaller data frames.

Instead of aggregating by sum, you can also aggregate by other measures like mean. 

#### TODO:

Let's compute the average number of daily new cases, deaths, and tests for each month.

In [25]:
covid_month_mean_df = covid_df.groupby('month')[['new_cases', 'new_deaths', 'new_tests']].mean()

In [26]:
covid_month_mean_df

Unnamed: 0_level_0,new_cases,new_deaths,new_tests
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,110.625,2.625,
3,3253.258065,373.225806,
4,3395.066667,536.366667,38144.636364
5,937.83871,182.516129,34797.419355
6,273.916667,46.8,27678.466667
7,216.83871,12.516129,25732.0
8,679.354839,11.129032,35442.064516
9,1099.0,6.666667,54395.0


Apart from grouping, another form of aggregation is the running or cumulative sum of cases, tests, or death up to each row's date. We can use the `cumsum` method to compute the cumulative sum of a column as a new series. Let's add three new columns: `total_cases`, `total_deaths`, and `total_tests`.

#### TODO:

Do the cumulative sum for new_cases, new_deaths, and new_tests and store the values into three new columns (total_cases, total_deaths, and total_tests)

In [27]:
# Remember that the initial tests are the following

initial_tests = 935310

In [28]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests,year,month,day,weekday
0,2020-02-22,14.0,0.0,,2020,2,22,5
1,2020-02-23,62.0,2.0,,2020,2,23,6
2,2020-02-24,53.0,0.0,,2020,2,24,0
3,2020-02-25,97.0,4.0,,2020,2,25,1
4,2020-02-26,93.0,5.0,,2020,2,26,2
...,...,...,...,...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0,2020,8,30,6
191,2020-08-31,1365.0,4.0,42583.0,2020,8,31,0
192,2020-09-01,996.0,6.0,54395.0,2020,9,1,1
193,2020-09-02,975.0,8.0,,2020,9,2,2


In [29]:
covid_df['total_cases'] = covid_df.new_cases.cumsum()

In [30]:
covid_df['total_deaths'] = covid_df.new_deaths.cumsum()

In [31]:
covid_df['total_tests'] = covid_df.new_tests.cumsum() + initial_tests

In [32]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests,year,month,day,weekday,total_cases,total_deaths,total_tests
0,2020-02-22,14.0,0.0,,2020,2,22,5,14.0,0.0,
1,2020-02-23,62.0,2.0,,2020,2,23,6,76.0,2.0,
2,2020-02-24,53.0,0.0,,2020,2,24,0,129.0,2.0,
3,2020-02-25,97.0,4.0,,2020,2,25,1,226.0,6.0,
4,2020-02-26,93.0,5.0,,2020,2,26,2,319.0,11.0,
...,...,...,...,...,...,...,...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0,2020,8,30,6,267295.5,35473.0,5117788.0
191,2020-08-31,1365.0,4.0,42583.0,2020,8,31,0,268660.5,35477.0,5160371.0
192,2020-09-01,996.0,6.0,54395.0,2020,9,1,1,269656.5,35483.0,5214766.0
193,2020-09-02,975.0,8.0,,2020,9,2,2,270631.5,35491.0,


Notice how the `NaN` values in the `total_tests` column remain unaffected.

## Merging data from multiple sources

To determine other metrics like test per million, cases per million, etc., we require some more information about the country, viz. its population. Let's open another file `locations.csv` that contains health-related information for many countries, including Italy.

#### TODO:

Read the locations.csv file into a new dataframe and call it locations_df

In [33]:
locations_df = pd.read_csv('locations.csv')

#### TODO:

Find the entry for location=='Italy'

In [34]:
locations_df[locations_df.location=='Italy']

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita
97,Italy,Europe,60461828.0,83.51,3.18,35220.084


In [35]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests,year,month,day,weekday,total_cases,total_deaths,total_tests
0,2020-02-22,14.0,0.0,,2020,2,22,5,14.0,0.0,
1,2020-02-23,62.0,2.0,,2020,2,23,6,76.0,2.0,
2,2020-02-24,53.0,0.0,,2020,2,24,0,129.0,2.0,
3,2020-02-25,97.0,4.0,,2020,2,25,1,226.0,6.0,
4,2020-02-26,93.0,5.0,,2020,2,26,2,319.0,11.0,
...,...,...,...,...,...,...,...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0,2020,8,30,6,267295.5,35473.0,5117788.0
191,2020-08-31,1365.0,4.0,42583.0,2020,8,31,0,268660.5,35477.0,5160371.0
192,2020-09-01,996.0,6.0,54395.0,2020,9,1,1,269656.5,35483.0,5214766.0
193,2020-09-02,975.0,8.0,,2020,9,2,2,270631.5,35491.0,


In [36]:
locations_df[locations_df.location == "Italy"]

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita
97,Italy,Europe,60461828.0,83.51,3.18,35220.084


We can merge this data into our existing data frame by adding more columns. However, to merge two data frames, we need at least one common column. Let's insert a `location` column in the `covid_df` dataframe with all values set to `"Italy"`.

#### TODO:

Create a new column in the covid_df dataframe called 'location' and set the value to 'Italy'

In [37]:
covid_df['location'] = "Italy"

In [38]:
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests,year,month,day,weekday,total_cases,total_deaths,total_tests,location
0,2020-02-22,14.0,0.0,,2020,2,22,5,14.0,0.0,,Italy
1,2020-02-23,62.0,2.0,,2020,2,23,6,76.0,2.0,,Italy
2,2020-02-24,53.0,0.0,,2020,2,24,0,129.0,2.0,,Italy
3,2020-02-25,97.0,4.0,,2020,2,25,1,226.0,6.0,,Italy
4,2020-02-26,93.0,5.0,,2020,2,26,2,319.0,11.0,,Italy
...,...,...,...,...,...,...,...,...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0,2020,8,30,6,267295.5,35473.0,5117788.0,Italy
191,2020-08-31,1365.0,4.0,42583.0,2020,8,31,0,268660.5,35477.0,5160371.0,Italy
192,2020-09-01,996.0,6.0,54395.0,2020,9,1,1,269656.5,35483.0,5214766.0,Italy
193,2020-09-02,975.0,8.0,,2020,9,2,2,270631.5,35491.0,,Italy


We can now add the columns from `locations_df` into `covid_df` using the `.merge` method.

#### TODO:

Merge covid_df and locations_df on='location' into a new dataframe. Call the new dataframe merged_df

In [39]:
merged_df = covid_df.merge(locations_df, on="location")

In [40]:
merged_df

Unnamed: 0,date,new_cases,new_deaths,new_tests,year,month,day,weekday,total_cases,total_deaths,total_tests,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita
0,2020-02-22,14.0,0.0,,2020,2,22,5,14.0,0.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084
1,2020-02-23,62.0,2.0,,2020,2,23,6,76.0,2.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084
2,2020-02-24,53.0,0.0,,2020,2,24,0,129.0,2.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084
3,2020-02-25,97.0,4.0,,2020,2,25,1,226.0,6.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084
4,2020-02-26,93.0,5.0,,2020,2,26,2,319.0,11.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0,2020,8,30,6,267295.5,35473.0,5117788.0,Italy,Europe,60461828.0,83.51,3.18,35220.084
191,2020-08-31,1365.0,4.0,42583.0,2020,8,31,0,268660.5,35477.0,5160371.0,Italy,Europe,60461828.0,83.51,3.18,35220.084
192,2020-09-01,996.0,6.0,54395.0,2020,9,1,1,269656.5,35483.0,5214766.0,Italy,Europe,60461828.0,83.51,3.18,35220.084
193,2020-09-02,975.0,8.0,,2020,9,2,2,270631.5,35491.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084


The location data for Italy is appended to each row within `covid_df`. If the `covid_df` data frame contained data for multiple locations, then the respective country's location data would be appended for each row.

We can now calculate metrics like cases per million, deaths per million, and tests per million.

### TODO:

Calculate the cases per million

In [41]:
merged_df['cases_per_million'] = merged_df.total_cases * 1e6 / merged_df.population

### TODO:

Calculate the deaths per million

In [42]:
merged_df['deaths_per_million'] = merged_df.total_deaths * 1e6 / merged_df.population

### TODO:

Calculate the tests per million

In [43]:
merged_df['tests_per_million'] = merged_df.total_tests * 1e6 / merged_df.population

In [44]:
merged_df

Unnamed: 0,date,new_cases,new_deaths,new_tests,year,month,day,weekday,total_cases,total_deaths,total_tests,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita,cases_per_million,deaths_per_million,tests_per_million
0,2020-02-22,14.0,0.0,,2020,2,22,5,14.0,0.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084,0.231551,0.000000,
1,2020-02-23,62.0,2.0,,2020,2,23,6,76.0,2.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084,1.256991,0.033079,
2,2020-02-24,53.0,0.0,,2020,2,24,0,129.0,2.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084,2.133578,0.033079,
3,2020-02-25,97.0,4.0,,2020,2,25,1,226.0,6.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084,3.737896,0.099236,
4,2020-02-26,93.0,5.0,,2020,2,26,2,319.0,11.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084,5.276056,0.181933,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,2020-08-30,1444.0,1.0,53541.0,2020,8,30,6,267295.5,35473.0,5117788.0,Italy,Europe,60461828.0,83.51,3.18,35220.084,4420.896768,586.700753,84644.943252
191,2020-08-31,1365.0,4.0,42583.0,2020,8,31,0,268660.5,35477.0,5160371.0,Italy,Europe,60461828.0,83.51,3.18,35220.084,4443.472996,586.766910,85349.238862
192,2020-09-01,996.0,6.0,54395.0,2020,9,1,1,269656.5,35483.0,5214766.0,Italy,Europe,60461828.0,83.51,3.18,35220.084,4459.946199,586.866146,86248.897403
193,2020-09-02,975.0,8.0,,2020,9,2,2,270631.5,35491.0,,Italy,Europe,60461828.0,83.51,3.18,35220.084,4476.072076,586.998461,


## Writing data back to files

After completing your analysis and adding new columns, you should write the results back to a file. Otherwise, the data will be lost when the Jupyter notebook shuts down. Before writing to file, let us first create a data frame containing just the columns we wish to record.

#### TODO:

Save the dataframe to a CSV file. Call the file results.csv

In [45]:
result_df = merged_df[['date',
                       'new_cases', 
                       'total_cases', 
                       'new_deaths', 
                       'total_deaths', 
                       'new_tests', 
                       'total_tests', 
                       'cases_per_million', 
                       'deaths_per_million', 
                       'tests_per_million']]

In [46]:
result_df

Unnamed: 0,date,new_cases,total_cases,new_deaths,total_deaths,new_tests,total_tests,cases_per_million,deaths_per_million,tests_per_million
0,2020-02-22,14.0,14.0,0.0,0.0,,,0.231551,0.000000,
1,2020-02-23,62.0,76.0,2.0,2.0,,,1.256991,0.033079,
2,2020-02-24,53.0,129.0,0.0,2.0,,,2.133578,0.033079,
3,2020-02-25,97.0,226.0,4.0,6.0,,,3.737896,0.099236,
4,2020-02-26,93.0,319.0,5.0,11.0,,,5.276056,0.181933,
...,...,...,...,...,...,...,...,...,...,...
190,2020-08-30,1444.0,267295.5,1.0,35473.0,53541.0,5117788.0,4420.896768,586.700753,84644.943252
191,2020-08-31,1365.0,268660.5,4.0,35477.0,42583.0,5160371.0,4443.472996,586.766910,85349.238862
192,2020-09-01,996.0,269656.5,6.0,35483.0,54395.0,5214766.0,4459.946199,586.866146,86248.897403
193,2020-09-02,975.0,270631.5,8.0,35491.0,,,4476.072076,586.998461,


To write the data from the data frame into a file, we can use the `to_csv` function. 

In [47]:
result_df.to_csv('results.csv', index=None)

The `to_csv` function also includes an additional column for storing the index of the dataframe by default. We pass `index=None` to turn off this behavior. You can now verify that the `results.csv` is created and contains data from the data frame in CSV format:

```
date,new_cases,total_cases,new_deaths,total_deaths,new_tests,total_tests,cases_per_million,deaths_per_million,tests_per_million
2020-02-27,78.0,400.0,1.0,12.0,,,6.61574439992122,0.1984723319976366,
2020-02-28,250.0,650.0,5.0,17.0,,,10.750584649871982,0.28116913699665186,
2020-02-29,238.0,888.0,4.0,21.0,,,14.686952567825108,0.34732658099586405,
2020-03-01,240.0,1128.0,8.0,29.0,,,18.656399207777838,0.47964146899428844,
2020-03-02,561.0,1689.0,6.0,35.0,,,27.93498072866735,0.5788776349931067,
2020-03-03,347.0,2036.0,17.0,52.0,,,33.67413899559901,0.8600467719897585,
...
```