<a href="https://colab.research.google.com/github/gauravshetty98/Gaurav-GIS-Repo/blob/main/data_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Extraction

This notebook represents the steps to download the zip file from GitHub repository and extract the data from its content and create a single dataset.

Dataset :  https://www.samhsa.gov/data/report/2021-nsduh-state-specific-tables

The dataset contains state-wise PDFs with drug abuse and mental health information of the respective state.

We start of by installing `tabula` library. This library is used to extract the tables from the PDFs.


In [29]:
!pip install -q tabula-py

Once the library is installed, we move on to import all the required libraries. We make use of google drive to store the PDFs present in the dataset.

**This notebook requires your google drive mounting to store dataset files. You can delete those files after running the notebook**



In [30]:
import os
from os import path
import tabula
import pandas as pd
import numpy as np

from google.colab import drive

drive.mount("/content//gdrive")

Drive already mounted at /content//gdrive; to attempt to forcibly remount, call drive.mount("/content//gdrive", force_remount=True).



`2021NSDUHsaeSpecificStatesTabs122022` is the name of the new folder. This folder will contain all the state wise PDFs.
A `tables` folder is also present, which will contain all the tables extracted from each state PDF.



In [31]:
dir = "//content//gdrive//My Drive//2021NSDUHsaeSpecificStatesTabs122022"
folder_name = "tables"

Creating a new folder if it is not present

In [None]:
if os.path.exists(dir) == False:
    os.mkdir(dir)
    print("done")
if os.path.exists(dir+"//"+folder_name) == False:
    os.mkdir(dir+"//"+folder_name)
    print("done")

!ls "//content//gdrive//My Drive//2021NSDUHsaeSpecificStatesTabs122022"

Downloading the zip file from GitHub and extracting it into the drive folder we created

In [34]:
!wget "https://github.com/gauravshetty98/Gaurav-GIS-Repo/raw/main/2021NSDUHsaeSpecificStatesTabs122022.zip"
!unzip /content/2021NSDUHsaeSpecificStatesTabs122022.zip -d "//content//gdrive//My Drive//2021NSDUHsaeSpecificStatesTabs122022"

--2023-09-12 15:01:41--  https://github.com/gauravshetty98/Gaurav-GIS-Repo/raw/main/2021NSDUHsaeSpecificStatesTabs122022.zip
Resolving github.com (github.com)... 140.82.121.3
Connecting to github.com (github.com)|140.82.121.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/gauravshetty98/Gaurav-GIS-Repo/main/2021NSDUHsaeSpecificStatesTabs122022.zip [following]
--2023-09-12 15:01:42--  https://raw.githubusercontent.com/gauravshetty98/Gaurav-GIS-Repo/main/2021NSDUHsaeSpecificStatesTabs122022.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13301032 (13M) [application/zip]
Saving to: ‘2021NSDUHsaeSpecificStatesTabs122022.zip.1’


2023-09-12 15:01:42 (267 MB/s) - ‘2021NSDUHsaeSpecificStatesTabs1

We iterate through all PDF files present in the new directory and extract the tables present in the first two pages using `tabula.read_pdf()`

We then concatenate the 2 extracted tables into a single dataframe and add a new column `states` to the dataframe.

Finally, we convert the dataframe into csv and store it in the `tables` folder in google drive

In [35]:
ext = ('.pdf','.PDF')
final_table = pd.DataFrame()
for files in os.listdir(dir):
    if files.endswith(ext):
        print(files)
        tables = tabula.read_pdf(dir + "//" + files,pages=[1,2]) #address of pdf file
        filename = files.replace("NSDUHsae","").replace(".pdf","").replace("2021","")
        df1 = pd.DataFrame(tables[0])
        df2 = pd.DataFrame(tables[1])
        final_table = pd.concat([df1,df2])
        final_table['states'] = filename
        #print(final_table)
        final_table.to_csv(os.path.join(dir + "//" + folder_name, filename+".csv"), index=False)
        print(filename)
    else:
        print("Else: ", files)
#final_table.to_csv(os.path.join(folder_name, "FinalTable.csv"), index=False)

Else:  tables
Else:  images
NSDUHsaeAlabama2021.pdf
Alabama
NSDUHsaeAlaska2021.pdf
Alaska
NSDUHsaeArizona2021.pdf
Arizona
NSDUHsaeArkansas2021.pdf
Arkansas
NSDUHsaeCalifornia2021.pdf
California
NSDUHsaeColorado2021.pdf
Colorado
NSDUHsaeConnecticut2021.pdf
Connecticut
NSDUHsaeDelaware2021.pdf
Delaware
NSDUHsaeDistrictOfCol2021.pdf
DistrictOfCol
NSDUHsaeFlorida2021.pdf
Florida
NSDUHsaeGeorgia2021.pdf
Georgia
NSDUHsaeHawaii2021.pdf
Hawaii
NSDUHsaeIdaho2021.pdf
Idaho
NSDUHsaeIllinois2021.pdf
Illinois
NSDUHsaeIndiana2021.pdf
Indiana
NSDUHsaeIowa2021.pdf
Iowa
NSDUHsaeKansas2021.pdf
Kansas
NSDUHsaeKentucky2021.pdf
Kentucky
NSDUHsaeLouisiana2021.pdf
Louisiana
NSDUHsaeMaine2021.pdf
Maine
NSDUHsaeMaryland2021.pdf
Maryland
NSDUHsaeMassachusetts2021.pdf
Massachusetts
NSDUHsaeMichigan2021.pdf
Michigan
NSDUHsaeMidwest2021.pdf
Midwest
NSDUHsaeMinnesota2021.pdf
Minnesota
NSDUHsaeMississippi2021.pdf
Mississippi
NSDUHsaeMissouri2021.pdf
Missouri
NSDUHsaeMontana2021.pdf
Montana
NSDUHsaeNational2021.pdf
N

Output shows all the CSVs extracted and stored state wise in the tables folder

In [36]:
!ls "//content//gdrive//My Drive//2021NSDUHsaeSpecificStatesTabs122022/tables"

Alabama.csv	   Indiana.csv	      National.csv	 RhodeIsland.csv
Alaska.csv	   Iowa.csv	      Nebraska.csv	 SouthCarolina.csv
Arizona.csv	   Kansas.csv	      Nevada.csv	 South.csv
Arkansas.csv	   Kentucky.csv       NewHampshire.csv	 SouthDakota.csv
California.csv	   Louisiana.csv      NewJersey.csv	 Tennessee.csv
Colorado.csv	   Maine.csv	      NewMexico.csv	 Texas.csv
Connecticut.csv    Maryland.csv       NewYork.csv	 Utah.csv
Delaware.csv	   Massachusetts.csv  NorthCarolina.csv  Vermont.csv
DistrictOfCol.csv  Michigan.csv       NorthDakota.csv	 Virginia.csv
Florida.csv	   Midwest.csv	      Northeast.csv	 Washington.csv
Georgia.csv	   Minnesota.csv      Ohio.csv		 West.csv
Hawaii.csv	   Mississippi.csv    Oklahoma.csv	 WestVirginia.csv
Idaho.csv	   Missouri.csv       Oregon.csv	 Wisconsin.csv
Illinois.csv	   Montana.csv	      Pennsylvania.csv	 Wyoming.csv


All the CSVs are concatenated into a single dataframe.

In [37]:
final_dataset = pd.DataFrame()
csv_path = dir + "//" + folder_name
for tables in os.listdir(csv_path):
  df3 = pd.read_csv(csv_path + "//" + tables)
  final_dataset = pd.concat([final_dataset,df3])
final_dataset.to_csv(os.path.join(dir + "//" + folder_name, "final.csv"), index=False)

In [38]:
print(final_dataset.head())

                                             Measure    12+ 12-17 18-25  26+  \
0                                      ILLICIT DRUGS    NaN   NaN   NaN  NaN   
1              Illicit Drug Use in the Past Month1,2    403    17    98  288   
2                     Marijuana Use in the Past Year    537    30   152  355   
3                    Marijuana Use in the Past Month    315    16    90  209   
4  Perceptions of Great Risk from Smoking Marijua...  1,128   105    83  940   

     18+   states  
0    NaN  Alabama  
1    386  Alabama  
2    507  Alabama  
3    299  Alabama  
4  1,023  Alabama  
