#### The individual case data (as opposed to the "running totals by country/state/etc") are updated daily at https://github.com/beoutbreakprepared/nCoV2019/tree/master/latest_data 

#### There's a detailed explanation of these data, and a full list of authors, at  https://www.nature.com/articles/s41597-020-0448-0.pdf

#### My process for using these data goes as follows:

### Step 1 --- Download the latest data in csv format from the first link shown above

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline
pd.set_option('display.max_columns', None)

In [2]:
updated = pd.read_csv('https://github.com/beoutbreakprepared/nCoV2019/raw/master/latest_data/latestdata.csv')
print(f'Shape: {updated.shape}')
updated.head(3)

Shape: (131260, 34)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,ID,age,sex,city,province,country,wuhan(0)_not_wuhan(1),latitude,longitude,geo_resolution,date_onset_symptoms,date_admission_hospital,date_confirmation,symptoms,lives_in_Wuhan,travel_history_dates,travel_history_location,reported_market_exposure,additional_information,chronic_disease_binary,chronic_disease,source,sequence_available,outcome,date_death_or_discharge,notes_for_discussion,location,admin3,admin2,admin1,country_new,admin_id,data_moderator_initials,travel_history_binary
0,,,,,Tucuman,Argentina,,-26.94,-65.34,admin1,,,27.03.2020,,,,,,,,,https://www.argentina.gob.ar/sites/default/fil...,,,,,,,,,,,,
1,,,,,Santa Fe,Argentina,,-33.7227,-62.246,admin1,,,27.03.2020,,,,,,,,,https://www.argentina.gob.ar/sites/default/fil...,,,,,,,,,,,,
2,,,,,Santa Fe,Argentina,,-33.7227,-62.246,admin1,,,27.03.2020,,,,,,,,,https://www.argentina.gob.ar/sites/default/fil...,,,,,,,,,,,,


That's a lot of NaN's.  What's actually not NaN in there?

In [4]:
updated.head(3).dropna(axis=1)

Unnamed: 0,province,country,latitude,longitude,geo_resolution,date_confirmation,source
0,Tucuman,Argentina,-26.94,-65.34,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...
1,Santa Fe,Argentina,-33.7227,-62.246,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...
2,Santa Fe,Argentina,-33.7227,-62.246,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...


In these most recent cases (27 March confirmation for the above 3 cases), the details available are limited to where the patient is, when the diagnosis of COVID-19 was confirmed, and where the authors found this info.  There are simply too many active cases of the disease by now for the authors to be able to fill in the NaN's for most of the columns.  The source for the first line, for example, is this: 

In [7]:
source = updated.loc[0, 'source']
source

'https://www.argentina.gob.ar/sites/default/files/27-03-20-reporte-diario-vespertino-covid-19.pdf'

In [None]:
# since the first line will change every time this notebook imports the data, I'm going to hard code the source
source = 'https://www.argentina.gob.ar/sites/default/files/27-03-20-reporte-diario-vespertino-covid-19.pdf'

In [46]:
from IPython.display import IFrame
IFrame(source, width=800, height=400)

This daily report from the Ministry of Health in Argentina only references the location of the first line's case, which is in the province of Tucumán, to mention that out of the 101 new cases confirmed or reclassified on this day in Argentina, 6 of them are in Tucumán.

In [15]:
tucuman27mar = updated[(updated.province == 'Tucuman') & (updated.date_confirmation == '27.03.2020')].dropna(axis=1)
tucuman27mar

Unnamed: 0,province,country,latitude,longitude,geo_resolution,date_confirmation,source
0,Tucuman,Argentina,-26.94,-65.34,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...
84,Tucuman,Argentina,-26.94,-65.34,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...
86,Tucuman,Argentina,-26.94,-65.34,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...
109,Tucuman,Argentina,-26.94,-65.34,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...
110,Tucuman,Argentina,-26.94,-65.34,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...
113,Tucuman,Argentina,-26.94,-65.34,admin1,27.03.2020,https://www.argentina.gob.ar/sites/default/fil...


In [16]:
# Maybe there's more info?  Unfortunately not
tucuman27mar.source.nunique()

1

So at this stage of the pandemic, the curators of these data seem to be just trying to keep their heads above water with the deluge of incoming cases.  From just one source that simply informs "6 new cases in Tucumán on 27 March", they have separate lines for each case, spanning 114 rows in the frame, and with identical features for all 6 cases.  We need to look back in time if we want some meaningful level of detail.  Back when these details were being reported and investigated as they occurred.

In [45]:
# Try 15th of January
jan15 = updated[updated.date_confirmation == '15.01.2020']
jan15.dropna(axis=1)

Unnamed: 0,ID,age,sex,province,country,wuhan(0)_not_wuhan(1),latitude,longitude,geo_resolution,date_onset_symptoms,date_admission_hospital,date_confirmation,symptoms,lives_in_Wuhan,travel_history_dates,travel_history_location,reported_market_exposure,source,outcome,date_death_or_discharge,notes_for_discussion,admin1,country_new,admin_id
40142,000-1-649,30,male,Tokyo,Japan,1.0,35.71145,139.4468,admin1,03.01.2020,10.01.2020,15.01.2020,"cough, fever, sore throat",no,06.01.2020,Wuhan,no,https://www.who.int/csr/don/17-january-2020-no...,discharged,15.01.2020,https://www.mhlw.go.jp/stf/newpage_08906.html,Tokyo,Japan,41


This line has a lot of useful details.  The age and gender of the patient are known and shown, for people analyzing those features, or for better identification even if not analyzing.  In addition to the date of confirmation of diagnosis, we also get a date of travel (6 Jan), the location of travel (Wuhan), the date of symptom onset (3 Jan), the date of hospital admission (10 Jan), details about the outcome of the case (discharged on 15 Jan, the same day as confirmation of the disease, interestingly), a short list of symptoms, and in addition to a World Health Organization source, there seems to be another URL for the Japanese Ministry of Health listed in the 'notes_for_discussion' column.  That's encouraging, because anyone trying to analyze this line of data will probably wonder whether the patient traveled from Wuhan to Tokyo, or the other way around, or maybe round trip, on the travel date shown.  And where was the patient when he was infected with the disease, Wuhan or Tokyo?  And when he traveled, what stage was his disease at?

In [47]:
source = updated.loc[40142, 'source']
source

'https://www.who.int/csr/don/17-january-2020-novel-coronavirus-japan-ex-china/en/'

In [48]:
IFrame(source, width=800, height=400)

This is helpful--We now know the patient very likely contracted the virus in late December in Wuhan, perhaps 5 days before the onset of his Jan. 3rd fever.  We also see that the patient flew home to Japan on Jan. 6th, which means there's a good chance he spread the disease to other travelers in both countries.  Furthermore, we see that the man's age, as shown in our data, is probably not correct, since the W.H.O. report shows him to be 30-39 years old, not 30 years old.  If we load the Japanese webpage in the "notes_for_discussion" column of our data, we see that Google translates it fairly fast into English, and lists the same 30-39 age for the man.  But it also provides the useful detail that the man lives in Kanagawa Prefecture, not Tokyo.  We're told that he went home after visiting a clinic on Jan. 6th, and regardless of whether that clinic was in Tokyo or Kanagawa, the patient traveled between the 2 cities at least once between 6-10 Jan. before checking into a hospital on the 10th.

### Step 2 --- Pore over the data sources and build a dataframe that has one event per line

It's easy to see that the aggregating and cleaning of data for this project is going to be a highly manual process, involving lots of cross-checking, scouring webpages, and hoping the details make sense.  There's no amount of NLP or automation that would make it possible to decipher what we did in the Japan case above by looking at the 2 sources provided.  Inherently, this stage of the project reduces to doing what half the world over is doing these days:  Reading hundreds of online articles about COVID-19. 

While the Japan case is still fresh in your memory, I'm going to briefly fast-forward to how it will look after this "Step 2".

In [216]:
events[events.Old_ID == '649']

Unnamed: 0,Old_ID,ID,AgeSex,Day,Went_From,Went_To,Status,Got_From,Got_At,From_geo,To_geo
389,649,96.0,(35)M,"(-12, -3, 3)",,,1.0,,Wuhan,,
390,649,96.0,(35)M,3,,,2.0,,,,
391,649,96.0,(35)M,6,Wuhan,Tokyo,2.0,,,"(30.58, 114.27)","(35.685, 139.7514)"
392,649,96.0,(35)M,6,Tokyo,Kanagawa,2.0,,,"(35.685, 139.7514)","(35.382, 139.217)"
393,649,96.0,(35)M,10,,,3.0,,,,


- One line has become 5 lines, 1 for each event.  The "Status" column indicates the progression of the disease:
    - 1.0 means the patient is infected.  So the row with the first 1.0 for each patient indicates when and where the patient got infected.  
    - 2.0 means the patient has symptoms.  So the first 2.0 row indicates when the patient had onset of symptoms.  
    - 3.0 means the patient entered the hospital, and thus the spread of his germs is hopefully contained (although unfortunately many cases spread from patients to medical personnel).  
    - Beyond these three lines that indicate change in status, the remaining lines show when and where the patient traveled.  In the Japan example, I can't tell when the patient moved from Tokyo to Kanagawa and maybe back to Tokyo again to the hospital, because the sources don't specify, but I can be pretty sure he at least traveled from Tokyo to Kanagawa on 6 Jan, whether he went to a clinic in Tokyo or in his hometown.
    
- The patient's age is now given as (35), meaning somewhere between 30 and 39.

- The days are now represented as integers, since I find that much easier to work with and grasp. Day 1 is 1 Jan, 2020, and day -1 is 31 Dec, 2019.  Just as our standard way of thinking has the year 1 A.D. directly following the year 1 B.C., there's no "day 0" here.  14 Feb. is day 45, and 27 Mar. is day 87, for example.  In all cases, I estimated the day of infection by showing a tuple that represents 3 days: earliest possible date, most likely date, and latest possible date.  In the Japan example, this tuple is (-12, -3, 3).  The latest possible day is 3 because that was the stated day of symptom onset.  The best guess day is -3 because that's 5 days before day 3 (remember, no day 0 here), and 5 days is the approximate mean calculated by epidemiologists, for how long it takes for symptoms to occur after contagion (also known as the "incubation period").  -12 is given as the earliest day possible for infection, because the confidence interval for the incubation period being less than 14 days is over 95%, according to medical experts. Could this patient have been infected before day -12 (20 Dec.)?  Yes, possibly, but the certainty of the date isn't important to my aims here, which is good, because no one in the world could rightly claim to 100% know when someone got infected.  If the source had stated "The patient arrived in Wuhan on 28 Dec.", rather than stating that the patient went to Wuhan in late December, I'd have used (-4, -3, 3) as the tuple.  And if the source showed that the patient went to Wuhan on 1 Jan., I'd have used (1, 1, 3).  Although in that case, I'd have done further internet searching to see if there was maybe some other information about this case.  2 days is not prohibitively short for an incubation period, but it's unlikely enough that I'd hesitate to outright believe it.  

- When it seems safe to call where the patient contracted the disease, I put that in the "Got_At" column of the first row with a Status of 1.  This disease has spread a lot on trains and planes, so in a few cases I listed start and end points for the patient's journey, if that seemed like the most likely place of contagion. And if there was another patient who was known to pass the disease to this patient, I put that in a "Got_From" column in the same row, using the other patient's ID, in case I or someone else can use that feature for analysis.  

- The geo location lat/lons shown are taken from simplemaps.com using the place names in the "Went_From" and "Went_To" columns of the respective rows as lookup keys.  To give you an idea of how simplemaps.com's data deals with ambiguities, their geo point for Kanagawa in the current case resolves to an arbitrary point in the city of Hadano, which is a medium-sized city in the Kanagawa Prefecture, nowhere near as large as the capital, Yokohama, or as Kawasaki.  Many locations had to be hand-coded, either because they weren't listed in simplemaps.com, or because the sources only specified a region or country instead of a city, or because I wanted to specify that the location was an airport.  The latter use case was when patients were transiting through an airport or when they were quarantined upon arrival at an airport.  If they were evacuated from Wuhan by their government, I didn't include them in this project, because they were presumably prevented from infecting anyone else on their planes and on their way to the hospital or quarantine location at home.  But if they flew into an airport and were stopped there by effective disease control measures, I included them and used the airport as the endpoint of their spread of the disease.  My reasoning was that they probably spread the disease to others on the plane, and all those other affected/infected passengers were then hopefully contacted later by authorities and told to either self-quarantine or self-monitor, but only after having possibly further spread the disease.  In some cases, the sources could only specify that the patient "traveled to Michigan, Illinois, and Ohio", for example.  My first reaction was to discard this type of case, but since those states are in the same region of the world, if the other elements of the case were unique or interesting enough to warrant keeping the case, I estimated a geo point to specify (Chicago, the largest and most-infected city in that region, in this particular example).  In such cases, I made a note in the location lookup dict in the code.  

After the intensive hours needed to enter each case by hand into a spreadsheet (I used Apple's Pages application), the resulting csv can be saved and processed as it's updated by going through the following routine:

In [218]:
events = pd.read_csv('Events.csv')
events.shape

(1402, 9)

In [219]:
#. citation for placename ==> lat/lon mapping:  https://simplemaps.com/data/world-cities
latlons = pd.read_csv('worldcities.csv')
latlons.columns

Index(['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3',
       'admin_name', 'capital', 'population', 'id'],
      dtype='object')

In [220]:
loc_dict = {city: (lat, lon) for city, lat, lon in zip(latlons.city_ascii,
                                                       latlons.lat, 
                                                       latlons.lng)}
loc_dict['Moscow']

(46.7307, -116.9986)

Unfortunately, the simplemaps.com tool is going to map common European city names to their American counterparts, so we'll have to hand code places like Moscow, Paris, Venice, Rome, London, etc.   This is a small time cost.

In [221]:
events.head()

Unnamed: 0.1,Unnamed: 0,ID,AgeSex,Day,Went_From,Went_To,Status,Got_From,Got_At
0,5.0,0.0,50F,6,Hefei,Wuhan,0.0,,
1,,0.0,,"(6, 6, 7)",,,1.0,,Wuhan
2,,0.0,,7,Wuhan,Hefei,1.0,,
3,,0.0,,10,,,2.0,,
4,,0.0,,21,,,3.0,,


The "Unnamed:0" column is the ID in the source, which I only typed once per case, same as AgeSex, so need to forward fill those to remove NaN's

In [222]:
events['Unnamed: 0'].fillna(method='ffill', inplace=True)
events.AgeSex.fillna(method='ffill', inplace=True)

In [223]:
# rename the unnamed column
events.rename(columns={'Unnamed: 0': 'Old_ID'}, inplace=True)

In [224]:
# manually drop a pin on a google map to fill in missing geo points
apost = e[2]
add_places = {'Chaohu':(31.632, 117.882),
              'Yinzhou':(29.808, 121.561),
              'Chizhou':(30.657, 117.491),
              '(Hunan, Jiangxi)':(27.884, 113.819), # arbitrary point, sort of near the middle of the border
              'Baise':(23.893, 106.615),
              'Hechi':(24.693, 108.085),
              'Fangchenggang':(21.686, 108.355),
              'Thailand':(13.74, 100.517), # Bangkok
              'WUH':(30.777, 114.216),
              'Beijing West':(39.894, 116.325),
              'Zhongwei':(37.516, 105.189),
              "Xi'an":(34.374, 108.939),
              'Zhoushan':(29.987, 122.202),
              'Rushan':(36.905, 121.549),
              'Xiaogan':(30.923, 113.939),
              'Dazhou':(31.203, 107.461),
              'Xishuangbanna':(22.014, 100.796),
              'Xincun':(32.861, 115.519),
              'Bozhou':(33.836, 115.775),
              'Lingao':(19.91, 109.683),
              'ICN':(37.45, 126.451),
              'GMP':(37.559, 126.803),
              'TPE':(25.08, 121.235),
              'BKK':(13.692, 100.752),
              'SEA':(47.444, -122.303),
              'Ho Chi Minh':(10.836, 106.619),
              'China':(loc_dict['Wuhan']),
              "Ma'anshan":(31.673, 118.499),
              'Jingzhou':(30.355, 112.192),
              'QDD':(36.266, 120.382),
              'CDG':(49.01, 2.547),
              'Switzerland':(46.96, 7.45),  # Bern, for a trip from Lombardy to Paris
              'Taiwan':(25.03, 121.58),
              'Sri Lanka':(6.915, 79.97),
              'Pyeongtaek':(37.01, 126.98),
              'Hokkaido':(43.085, 141.333),
              'Puyang':(35.77, 115.09),
              'Lujiang':(31.253, 117.307),
              'Xiaogan':(30.913, 113.973),
              'Baoji':(34.352, 107.241),
              "Yan'an":(36.597, 109.49),
              'Ivalo Village':(68.658, 27.54),
              'Kanagawa':(35.382, 139.217),
              'HKG':(22.313, 113.928),
              'CEB':(10.312, 123.98),
              'DGT':(9.33, 123.296),
              'HND':(35.55, 139.78),
              'Mie':(34.567, 136.56),
              'SIN':(1.362, 103.99),
              'Lijiang':(26.86,100.234),
              'Sanxiang':(22.36, 113.44),
              'Santa Clara Cty':(37.2, -121.69),
              'CTU':(30.572, 103.952),
              'Gunsan':(35.971, 126.727),
              'Japan':loc_dict['Tokyo'],
              'Gunpo':(37.342, 126.926),
              'Suizhou':(31.713, 113.37),
              'Qingyang':(35.694, 107.634),
              'Tianshui':(34.582, 105.722),
              'Yingcheng':(30.942, 113.572),
              'Sabah':(5.961, 116.072),
              'TSN':(39.126, 117.359),
              'Luxizhen':(27.634, 114.044),
              'Shangluo':(33.864, 109.93),
              'Cebu':(10.309, 123.895),
              'Bohol':(9.649, 123.852),
              'Xuanwei':(26.234, 104.108),
              'Italy':(41.89, 12.492),  # Rome Colisseum
              'Tongchuan':(34.908, 108.941),
              'Dingxi':(35.591, 104.623),
              'Baiyin':(36.542, 104.136),
              'Liupanshui':(26.589, 104.84),
              'Loudi':(27.7, 111.994),
              'Maui':(20.867, -156.478),  # largest population center
              'Linxia':(35.602, 103.21),
              'Iran':loc_dict['Tehran'],
              'Segovia':(40.941, -4.107),
              'Rome':(41.89, 12.492),
              'Venice':(45.437, 12.333),
              'Milan':(45.464, 9.19),
              'Florence':(43.78, 11.258),
              'Verona':(45.439, 10.995),
              'East Coast Demerara':(6.816, -58.064),
              'Ivalo':(68.658, 27.54),  # same as Ivalo Village, above
              'Andalo':(46.165, 11.003),
              'Paris':(48.853, 2.35),
              'Moscow':(55.76, 37.619),
              'Tel Aviv':(32.078, 34.774),
              'Heilbronn District':(49.151, 9.22),
              'South Tyrol':(46.495, 11.347),
              'Brandenburg':(52.412, 12.555),  # chose the region within the region.  Could just be Berlin
              'Trentino':(46.062, 11.127),
              'Hamburg':(53.541, 9.984),
              'Segeberg':(53.946, 10.207),
              'Ramat Gan':(32.069, 34.826),  # basically same as Tel Aviv, but they specified, so...
              'Uetze':(52.464, 10.203),
              'Freiburg':(53.826, 9.289),
              'Livigno':(46.539, 10.136),
              'Lombardy':(45.811, 9.084),  # Chose a point of tourism in the area
              'Wetzlar':(50.565, 8.504),
              'DIA':(25.265, 51.56),  # Doha airport, not Denver Int'l
              'Berkeley':(37.872, -122.262),
              'Sydney':(-33.861, 151.211),
              'Egypt':(30.047, 31.234), # Cairo
              'LAX':(33.941, -118.41),
              'Melbourne':(-37.799, 144.961),
              '(MI, IL, OH)':loc_dict['Chicago'],  # chose the biggest city, since this was a business trip
              'YUL':(45.455, -73.755),
              '(Rome, Milan)':(43.36, 11.267),  # chose an arbitrary point, halfway
              'KUL':(2.74, 101.685),
              'DPS':(-8.744, 115.166),
              'TAO':(36.267, 120.383),
              
             }

In [225]:
loc_dict.update(add_places)
loc_dict['Moscow']

(55.76, 37.619)

Excellent--We've moved Moscow from Idaho back to its home in Russia.

In [226]:
from_geo = [loc_dict[place] if (place and place in loc_dict) else np.nan for place in events.Went_From]
to_geo = [loc_dict[place] if (place and place in loc_dict) else np.nan for place in events.Went_To]
events['From_geo'] = from_geo
events['To_geo'] = to_geo

In [227]:
# check that the travel locations all have lat/lons for both to and from locations
incomplete = np.logical_xor(events.From_geo.notnull(), events.To_geo.notnull())
events.iloc[np.where(incomplete)]

Unnamed: 0,Old_ID,ID,AgeSex,Day,Went_From,Went_To,Status,Got_From,Got_At,From_geo,To_geo
243,260,59.0,22M,17,Wuhan,Xi’an,1.0,,,"(30.58, 114.27)",
246,261,60.0,32F,12,Xi’an,Hangzhou,0.0,,,,"(30.25, 120.17)"
248,261,60.0,32F,14,Hangzhou,Xi’an,1.0,,,"(30.25, 120.17)",
360,601,89.0,64F,14,Wuhan,Xi’an,2.0,,,"(30.58, 114.27)",
491,883,121.0,43M,20,Wuhan,Ma’anshan,1.0,,,"(30.58, 114.27)",
496,884,122.0,48F,20,Wuhan,Ma’anshan,2.0,,,"(30.58, 114.27)",
499,885,123.0,54F,16,Wuhan,Ma’anshan,1.0,,,"(30.58, 114.27)",
503,886,124.0,47M,18,Wuhan,Ma’anshan,1.0,,,"(30.58, 114.27)",
507,887,125.0,55M,9,Wuhan,Ma’anshan,1.0,,,"(30.58, 114.27)",
511,888,126.0,47F,16,Wuhan,Ma’anshan,1.0,,,"(30.58, 114.27)",


Looks like the apostrophes that came from Apple's Pages app, where I built the data, are causing trouble.

In [228]:
apost = events.loc[243, 'Went_To'][2]  # the problem character
loc_dict['Xi' + apost + 'an'] = loc_dict["Xi'an"]
loc_dict['Yan' + apost + 'an'] = loc_dict["Yan'an"]
loc_dict['Ma' + apost + 'anshan'] = loc_dict["Ma'anshan"]

In [229]:
from_geo = [loc_dict[place] if (place and place in loc_dict) else np.nan for place in events["Went_From"]]
to_geo = [loc_dict[place] if (place and place in loc_dict) else np.nan for place in events["Went_To"]]
events['From_geo'] = from_geo
events['To_geo'] = to_geo

In [230]:
incomplete = np.logical_xor(events.From_geo.notnull(), events.To_geo.notnull())
events.iloc[np.where(incomplete)]

Unnamed: 0,Old_ID,ID,AgeSex,Day,Went_From,Went_To,Status,Got_From,Got_At,From_geo,To_geo


The logical XOR made sure that any line that had a "From_geo" also had a "To_geo", but just to be sure that both weren't missing in a row where they should exist:

In [231]:
incomplete = np.logical_xor(events.Went_To.notnull(), events.To_geo.notnull())
events.iloc[np.where(incomplete)]

Unnamed: 0,Old_ID,ID,AgeSex,Day,Went_From,Went_To,Status,Got_From,Got_At,From_geo,To_geo


####  Final Step -- Save this DataFrame so I can skip Step 2 when possible (i.e. when not using new data)