<a href="https://colab.research.google.com/github/gauravshetty98/Gaurav-GIS-Repo/blob/main/data_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Extraction and Cleaning

This notebook represents the steps to download the zip file from GitHub repository and extract the data from its content and create a single dataset.

Dataset :  https://www.samhsa.gov/data/report/2021-nsduh-state-specific-tables

The dataset contains state-wise PDFs with drug abuse and mental health information of the respective state.

We start of by installing `tabula` library. This library is used to extract the tables from the PDFs.


In [None]:
!pip install -q tabula-py

Once the library is installed, we move on to import all the required libraries. We make use of google drive to store the PDFs present in the dataset.

**This notebook requires your google drive mounting to store dataset files. You can delete those files after running the notebook**



In [None]:
import os
from os import path
import tabula
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim


from google.colab import drive

drive.mount("/content//gdrive")

### Creating folders for storing all the data

`2021NSDUHsaeSpecificStatesTabs122022` is the name of the new folder. This folder will contain all the state wise PDFs.
A `tables` folder is also present, which will contain all the tables extracted from each state PDF.



In [None]:
dir = "//content//gdrive//My Drive//2021NSDUHsaeSpecificStatesTabs122022"
folder_name = "tables"

Creating a new folder if it is not present

In [None]:
if os.path.exists(dir) == False:
    os.mkdir(dir)
    print("done")
if os.path.exists(dir+"//"+folder_name) == False:
    os.mkdir(dir+"//"+folder_name)
    print("done")

!ls "//content//gdrive//My Drive//2021NSDUHsaeSpecificStatesTabs122022"

Downloading the zip file from GitHub and extracting it into the drive folder we created

In [None]:
!wget "https://github.com/gauravshetty98/Gaurav-GIS-Repo/raw/main/2021NSDUHsaeSpecificStatesTabs122022.zip"
!unzip /content/2021NSDUHsaeSpecificStatesTabs122022.zip -d "//content//gdrive//My Drive//2021NSDUHsaeSpecificStatesTabs122022"

### Concatenating all files and creating a single DataFrame

We iterate through all PDF files present in the new directory and extract the tables present in the first two pages using `tabula.read_pdf()`

We then concatenate the 2 extracted tables into a single dataframe and add a new column `states` to the dataframe.

Finally, we convert the dataframe into csv and store it in the `tables` folder in google drive

In [None]:
ext = ('.pdf','.PDF')
final_table = pd.DataFrame()
for files in os.listdir(dir):
    if files.endswith(ext):
        print(files)
        tables = tabula.read_pdf(dir + "//" + files,pages=[1,2]) #address of pdf file
        filename = files.replace("NSDUHsae","").replace(".pdf","").replace("2021","")
        df1 = pd.DataFrame(tables[0])
        df2 = pd.DataFrame(tables[1])
        final_table = pd.concat([df1,df2])
        final_table['states'] = filename
        #print(final_table)
        final_table.to_csv(os.path.join(dir + "//" + folder_name, filename+".csv"), index=False)
        print(filename)
    else:
        print("Else: ", files)
#final_table.to_csv(os.path.join(folder_name, "FinalTable.csv"), index=False)

Output shows all the CSVs extracted and stored state wise in the tables folder

In [None]:
!ls "//content//gdrive//My Drive//2021NSDUHsaeSpecificStatesTabs122022/tables"

All the CSVs are concatenated into a single dataframe.

In [None]:
final_dataset = pd.DataFrame()
csv_path = dir + "//" + folder_name
for tables in os.listdir(csv_path):
  df3 = pd.read_csv(csv_path + "//" + tables)
  final_dataset = pd.concat([final_dataset,df3],ignore_index = True)

---------------------
## Data Cleaning

Here we start with the data cleaning. We first replace all the `NaN` values with 0.

In [None]:
print(final_dataset.head(10))

### Dropping Empty Rows

The dataset contains some rows which are empty and dont contain any necessary information. Here we search for those rows and drop them from the dataset

Count of empty rows in the dataset = 392

In [None]:
from zmq import NULL
count = 0
for i in range(0,final_dataset.shape[0]):
  if list(final_dataset.iloc[i,1:6]) == list([np.NaN, np.NaN, np.NaN, np.NaN, np.NaN]):
    final_dataset.iloc[i+1,0] = str(final_dataset.iloc[i,0]) + " " + str(final_dataset.iloc[i+1,0])
    count = count + 1

print("Number of empty rows: ",count)
print("Acutal data we need: ", final_dataset.shape[0] - count)

We make use of the `dropna()` function to drop all the empty rows. You can see in the resulting dataframe there are no empty rows.

Final shape of the dataset = 2016 * 7

In [None]:
final_dataset = final_dataset.dropna()
final_dataset = final_dataset.reset_index()
print(final_dataset.head(10))

### Replacing more Null values

The dataset still contains some null values. During the extraction process the null values in the PDF were converted to string `"--"`. Also there are some `*` in the dataset which represent data with low precision or no estimates. We search for these strings/values and replace it with zero as they are of no use in our dataset.

In [None]:
print("Before: ", final_dataset.iloc[260,:])
final_dataset = final_dataset.replace("--",0)
final_dataset = final_dataset.replace("*",0)
print("After: ", final_dataset.iloc[260,:])

--------
### Storing the dataset into a CSV file

In [None]:
final_dataset.to_csv(os.path.join(dir + "//" + folder_name, "state_dataset.csv"), index=False)

------
The Cleaning of NSDUH dataset is done. This dataset will be further used for the plotting problem sets.

This notebook will be updated and new sections will be added if any new dataset is being imported and it requires some cleaning

# Adding total count to rehab centre dataset

In [None]:
! wget -q -O rehabData.csv https://github.com/gauravshetty98/Gaurav-GIS-Repo/raw/main/NSSATS_PUF_2020_CSV.csv

In [None]:
rehab_df = pd.read_csv('rehabData.csv')
rehab_df.head()

In [None]:
count = 1
count_list = []
for i in range(rehab_df.shape[0]-1):
  if rehab_df.iloc[i,1] == rehab_df.iloc[i+1, 1]:
    count = count + 1
  else:
    for y in range(count):
      count_list.append(count)
    #rehab_df[i,262] = count
    #print(rehab_df.iloc[i,1], rehab_df.iloc[i,rehab_df.shape[1]-1])
    count = 1
for y in range(count):
  count_list.append(count)
rehab_df['Total Count'] = count_list

print(len(count_list))
print(rehab_df.shape)
rehab_df['STATE'].unique()

In [None]:
rehab_df.to_csv('rehab_dataset.csv')

In [None]:
! wget -q -O mental_rehab.csv https://github.com/gauravshetty98/Gaurav-GIS-Repo/raw/main/nmhss-puf-2020-csv.csv

In [None]:
mental_df = pd.read_csv('mental_rehab.csv')
mental_df.head()

In [None]:
count = 1
count_list = []
for i in range(mental_df.shape[0]-1):
  if mental_df.iloc[i,1] == mental_df.iloc[i+1, 1]:
    count = count + 1
  else:
    for y in range(count):
      count_list.append(count)
    #rehab_df[i,262] = count
    #print(rehab_df.iloc[i,1], rehab_df.iloc[i,rehab_df.shape[1]-1])
    count = 1
for y in range(count):
  count_list.append(count)
mental_df['Total Count'] = count_list

print(len(count_list))
print(mental_df.shape)
print(mental_df['LST'].unique())
mental_df['Total Count'].unique()

In [None]:
mental_df.to_csv('mental_health_treatment.csv')

In [None]:
!wget -q -O rehab_dir.xlsx https://github.com/gauravshetty98/Gaurav-GIS-Repo/raw/main/national-directory-su-facilities-2023.xlsx

In [None]:
rehab_dir = pd.read_excel('rehab_dir.xlsx', sheet_name = 0)
print(rehab_dir.head())
print(rehab_dir.shape)

In [None]:
loc = Nominatim(user_agent="Geopy Library")

getLoc = loc.geocode("İzmir")

print("Latitude = ", getLoc.latitude, "\n")
print("Longitude = ", getLoc.longitude)

In [None]:
lat_list = []
long_list = []

for i in range(0,rehab_dir.shape[0]):
  try:
    if len(str(rehab_dir.iloc[i,6])) == 4:
      getLoc = loc.geocode("0"+str(rehab_dir.iloc[i,6])+" , united states")
      lat_list.append(getLoc.latitude)
      long_list.append(getLoc.longitude)
    else:
      getLoc = loc.geocode(str(rehab_dir.iloc[i,6])+" , united states")
      lat_list.append(getLoc.latitude)
      long_list.append(getLoc.longitude)
  except:
    lat_list.append(0)
    long_list.append(0)
    print("not found: ", i)

print(len(lat_list))
print(len(long_list))

In [None]:
rehab_dir['lat'] = lat_list
rehab_dir['long'] = long_list
rehab_dir.to_csv('rehab_dir_24oct.csv')