<a href="https://colab.research.google.com/github/falakchhaya/COVID19_India/blob/master/COVID19_India_Deaths%26Recovery_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Steps

(1) import the libraries.

(2) Get the json data from api. Normalize it and if it's a nested dictionary, use the necessary arguments. Output of this should be a proper pandas dataframe.

(3) Start inspecting the data. shape, info(), describe(), head(), (df.isna()).sum etc. 

--> df.shape : Tells you your matrix size. How many features and instances.<br>
--> df.info() : Very insightfull. Gives you all the columns, corresponding data types, non-null values in each of them and memory usage. <br>
--> df.head() : Shows you initial small part of the df.
<br><br>
df.info() and df.head() together should be used to decide if datatype of any columns should be changed or not. Typically quite a few should be changed from 'object' to 'int64'/'float64'/'category'.

(4) Missing data isn't always in the form of NaN/Null. It might be a blank space or a hyphen('-') or who knows what ! So df.isna() might not help. That's when following commands may help

--> pd.unique(df.column)  <br>
--> df.column.value_counts() <br>
--> df.column.value_counts()[value] <br>
--> df.column.value_counts()(normalize=True) <br>

Alternatively, you may write a for loop and iterate over column names and get all value_counts() to get a wholesome picture.
<br><br><br>
In this particular case, I took COVID data for deaths and recoveries for India and found that most of the important values were missing 
i.e. 90% of Age is missing. Similar is the case with Cities, gender, cities, districts


State, Statecode, patientStatus and to an extent Date are the only reasonable columns. Not much insight can be drawn from this it seems, except <br>

--> Number of death in each state
--> deaths with timeline (though we don't have dates for a few)
--> if we join this with number of cases per state along with timeline, we might get deaths per cases ratio per state.


In [0]:
#Let's use the api instead
#first let's import stuff

import numpy as np
import pandas as pd
import requests
import json
from pandas import json_normalize
import matplotlib.pyplot as plt

In [0]:
url = 'https://api.covid19india.org/deaths_recoveries.json'

JSONContent = requests.get(url).json()
#content = json.dumps(JSONContent, indent = 4, sort_keys=True)  #, indent = 4
#help(json.dumps)
#print(content)

In [18]:
type(JSONContent)

dict

In [0]:
# Let's see if normalization works well
nested_full = json_normalize(JSONContent)

In [22]:
nested_full.shape
nested_full.head()

#It seems the data has become completely flat, we'll have to use "record_path" ...

Unnamed: 0,deaths_recoveries
0,"[{'agebracket': '85', 'city': 'Mumbai', 'date'..."


In [0]:
nested_full = json_normalize(JSONContent,record_path='deaths_recoveries')

In [25]:
nested_full.head()  # Now that sees alright.

Unnamed: 0,agebracket,city,date,district,gender,nationality,notes,patientnumbercouldbemappedlater,patientstatus,slno,source1,source2,source3,state,statecode
0,85.0,Mumbai,29/03/2020,Mumbai,M,,"Suffering from Diabetes, had a pacemaker, no t...",,Deceased,1,https://arogya.maharashtra.gov.in/pdf/epressno...,https://www.deccanherald.com/national/west/dea...,,Maharashtra,MH
1,80.0,Mumbai,29/03/2020,Mumbai,M,,"patient passed away at the Fortis Hospital, Mu...",,Deceased,2,https://arogya.maharashtra.gov.in/pdf/epressno...,https://www.indiatoday.in/india/story/coronavi...,,Maharashtra,MH
2,86.0,Ghatkopar,29/03/2020,Mumbai Suburban,F,,,,Deceased,3,https://arogya.maharashtra.gov.in/pdf/epressno...,,,Maharashtra,MH
3,,,29/03/2020,Mumbai,,,,,Deceased,4,https://arogya.maharashtra.gov.in/pdf/epressno...,,,Maharashtra,MH
4,,,29/03/2020,Mumbai,,,,,Deceased,5,https://arogya.maharashtra.gov.in/pdf/epressno...,,,Maharashtra,MH


In [26]:
print(nested_full.shape)
nested_full.describe()

(2416, 15)


Unnamed: 0,agebracket,city,date,district,gender,nationality,notes,patientnumbercouldbemappedlater,patientstatus,slno,source1,source2,source3,state,statecode
count,2416.0,2416.0,2416,2416.0,2416.0,2416.0,2416.0,2416.0,2416,2416,2416.0,2416.0,2416.0,2416,2416
unique,66.0,55.0,26,147.0,3.0,3.0,277.0,61.0,3,2416,349.0,90.0,11.0,32,32
top,,,16/04/2020,,,,,,Recovered,1780,,,,Maharashtra,MH
freq,2166.0,2289.0,285,1596.0,2158.0,2400.0,1678.0,2356.0,1955,1,409.0,2218.0,2401.0,494,493


In [27]:
nested_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2416 entries, 0 to 2415
Data columns (total 15 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   agebracket                       2416 non-null   object
 1   city                             2416 non-null   object
 2   date                             2416 non-null   object
 3   district                         2416 non-null   object
 4   gender                           2416 non-null   object
 5   nationality                      2416 non-null   object
 6   notes                            2416 non-null   object
 7   patientnumbercouldbemappedlater  2416 non-null   object
 8   patientstatus                    2416 non-null   object
 9   slno                             2416 non-null   object
 10  source1                          2416 non-null   object
 11  source2                          2416 non-null   object
 12  source3                          2

In [28]:
# So all the data types are of the type object. Clearly we need to fix this.

# One of the proposed solution of Geeksforgeeks, but sadly didn't work.
# nested_full = nested_full.infer_objects()
# print(nested_full.dtypes)


# Here is another proposed solution
nested_full.columns

Index(['agebracket', 'city', 'date', 'district', 'gender', 'nationality',
       'notes', 'patientnumbercouldbemappedlater', 'patientstatus', 'slno',
       'source1', 'source2', 'source3', 'state', 'statecode'],
      dtype='object')

In [0]:
# model line
# nested_full.city = nested_full.city.astype('category')
# you can also make a list out for similar ones and change their datatypes together, like shown below.

nested_full[['city', 'district','gender','nationality', 'patientstatus','state','statecode']] = nested_full[['city', 'district','gender','nationality', 'patientstatus','state','statecode']].astype('category')

In [30]:
nested_full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2416 entries, 0 to 2415
Data columns (total 15 columns):
 #   Column                           Non-Null Count  Dtype   
---  ------                           --------------  -----   
 0   agebracket                       2416 non-null   object  
 1   city                             2416 non-null   category
 2   date                             2416 non-null   object  
 3   district                         2416 non-null   category
 4   gender                           2416 non-null   category
 5   nationality                      2416 non-null   category
 6   notes                            2416 non-null   object  
 7   patientnumbercouldbemappedlater  2416 non-null   object  
 8   patientstatus                    2416 non-null   category
 9   slno                             2416 non-null   object  
 10  source1                          2416 non-null   object  
 11  source2                          2416 non-null   object  
 12  source

In [14]:
pd.unique(nested_full.agebracket)

array(['85', '80', '86', '', '45', '74', '67', '70', '46', '47', '34',
       '65', '57', '40', '44', '52', '62', '69', '78', '56', '42', '41',
       '38', '68', '50', '49', '75', '25', '51', '72', '24', '73', '54',
       '32', '23', '55', '63', '36', '61', '29', '53', '60', '71', '58',
       '77', '64', '66', '27', '22', '37', '20', '30', '1', '48', '76',
       '59', '35', '26', '13', '14', '33', '8', '92', '18', '31', '95'],
      dtype=object)

In [46]:
nested_full.agebracket.value_counts()
nested_full.agebracket.value_counts(normalize=True)  # The result means 89% of the cases don't have age <facepalm>

      0.890742
65    0.007898
55    0.005704
52    0.004827
60    0.003949
        ...   
92    0.000439
74    0.000439
8     0.000439
31    0.000439
30    0.000439
Name: agebracket, Length: 66, dtype: float64

In [58]:
# Just to learn
nested_full.agebracket.value_counts()['65']  #gives you count of occurances where the value of agebracket is '65'.

18

In [0]:
# Two more columns need to change their datatypes. Age and Date.

nested_full.agebracket = nested_full.agebracket.astype('int32')

# Just to learn
# The instances where value is blank space, is where the error is occuring.
# One possible option is to convert all the blank spaces to Nan and then try Int64 instead of int64 (source: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html#integer-na)

In [60]:
# But this throws an error as the blank spaces can't be converted to int.
#Let's see how to handle them.
#nested_full.agebracket.replace(to_replace='',value=Nan)  #wait I found a better method
print(nested_full.agebracket.replace(r'^\s*$', np.nan, regex=True))

0        85
1        80
2        86
3       NaN
4       NaN
       ... 
2274    NaN
2275    NaN
2276    NaN
2277    NaN
2278    NaN
Name: agebracket, Length: 2279, dtype: object


In [64]:
nested_full.agebracket = nested_full.agebracket.replace(r'^\s*$', np.nan, regex=True)
nested_full.agebracket.value_counts(normalize=True)


65    0.072289
55    0.052209
52    0.044177
60    0.036145
45    0.036145
        ...   
30    0.004016
77    0.004016
18    0.004016
31    0.004016
95    0.004016
Name: agebracket, Length: 65, dtype: float64

In [32]:
a = (nested_full.agebracket.isna()).sum()  # number of nulls for age
b = (nested_full.agebracket.notna()).sum()  # number of available values for age
b/(a+b)  # only 10.9% data is available for this age column

(nested_full.isna()).sum()  # but most of the unavailable value is blank and not null, so let's see.
nested_full.date.value_counts()[''] 
nested_full.columns

Index(['agebracket', 'city', 'date', 'district', 'gender', 'nationality',
       'notes', 'patientnumbercouldbemappedlater', 'patientstatus', 'slno',
       'source1', 'source2', 'source3', 'state', 'statecode'],
      dtype='object')

#Important Note
Output of the next few cells gives a decent picture

In [70]:
#nested_full.agebracket = nested_full.replace(r'^\s*$', np.nan, regex=True)
nested_deceased = nested_full[nested_full.patientstatus =='Deceased']

for col in nested_full.columns:
 print("For " + str(col))
 print(nested_full[col].value_counts())
  



For agebracket
      2166
65      18
55      13
52      11
60       9
      ... 
92       1
30       1
77       1
8        1
14       1
Name: agebracket, Length: 66, dtype: int64
For city
                    2289
Mumbai                20
Indore                14
Pune                  10
Pimpri-Chinchwad       9
Hyderabad              7
Ahmadabad              5
Chennai                4
Dharavi                2
Kalyan-Dombivli        2
Ludhiana               2
Bhubaneswar            2
Bhavnagar              2
Palghar                2
Bengaluru              2
Thagarapuvalasa        2
Vijayawada             2
Surat                  2
Ujjain                 2
Belghoria              1
Bikaner                1
Bagalkote              1
Chandigarh             1
Bhatkal                1
Baddi                  1
Chhindwara             1
Aurangabad             1
Dachepalli             1
Armpora Sopore         1
Agra                   1
Girgaon                1
Dum Dum                1
Ghatkopar   

In [53]:
#nested_full.agebracket = nested_full.replace(r'^\s*$', np.nan, regex=True)
nested_deceased = nested_full[nested_full.patientstatus =='Deceased']

for col in nested_full.columns:
 print("For " + str(col))
 print(nested_deceased[col].value_counts())
  # print("For " + str(col))
  # print(nested_deceased[col].value_counts()[''])
  


For agebracket


KeyError: ignored

In [68]:
nested_full.shape[0]

2416