# 01. Download Setup

The U.S. National Library of Medicine has a collection of pill images as part of a project that was discontinued in 2018.  This collection consisted of over 137,000 pill images, split between reference pill images and consumer-grade pill images.

The 4,000+ reference pill images were taken in a controlled setting, across multiple resolutions.  This collection of images is downloadable as a 6.8GB zip file ([link](https://www.nlm.nih.gov/databases/download/pill_image.html)).  We will use the reference pill images as our training data.

The consumer-grade pill images will be used as our testing data, given that we ultimately want to properly identify the pills on images provided by consumers, in real world settings.  Acquiring said testing data, however, is a different story.  The consumer-grade pill images are stored on an FTP site ([link](ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/Pills/)) and can only be downloaded one-by-one.  Given that there are over 133,000 images in total, that would be next to impossible to acquire manually, given the time constrains.

In this case, we will leverage a directory of all consumer-grade images (originally downloaded as an `.xlsx` file and converted to a `.csv` file) as well as the `webbrowser` library in order to download each file locally.

This notebook will make adjustments to said `.csv` file and save it down for other notebooks to properly access the file and download the respective images.

---
## Table of Contents

- [01. Importing Libraries](#01.-Importing-Libraries)
- [02. Quick EDA](#02.-Quick-EDA)
- [03. Amending Directory](#03.-Amending-Directory)
- [04. Saving Down Changes](#04.-Saving-Down-Changes)

---
### 01. Importing Libraries

We only need a single library for what this notebook will accomplish: `pandas`.  `pandas` will be used to read-in and ammend the csv file needed for future notebooks.

In [1]:
import pandas as pd

---
### 02. Quick EDA

For this section, we will read-in the csv file and check to see if there are any missing values, what the data looks like, and so on.

In [2]:
# Read-in csv file
consumer = pd.read_csv('../data/directory_consumer_grade_images.csv')

# Take a look at the top 5 rows
consumer.head()

Unnamed: 0,NDC11,Part,Image,Layout,Name
0,2322730,1,PillProjectDisc69/images/CLLLLUPGIX7J8MP1WWQ9W...,C3PI_Reference,STRATTERA 10MG
1,2322730,1,PillProjectDisc98/images/PRNJ-AXZIQ!HUQKJJBP_D...,C3PI_Reference,STRATTERA 10MG
2,2322730,1,PillProjectDisc10/images/79U-YY6M1UUR6F127ZMAC...,C3PI_Test,STRATTERA 10MG
3,2322730,1,PillProjectDisc11/images/7WVFV5H74!ELFNQ_GUH92...,C3PI_Test,STRATTERA 10MG
4,2322730,1,PillProjectDisc20/images/B4CH0R9B7PEQ6GORRX-8X...,C3PI_Test,STRATTERA 10MG


In [3]:
# Look at the total amount of rows (total image count)
consumer.shape[0]

133774

In [4]:
# Grouping the images by their respective medications to see how many medications there are total
consumer.groupby(consumer['Name']).count().shape[0]

3010

This directory lists a total of 133,774 images across 3,010 different types of medications that we can potentially download.  For the sake of convenience, we will only pull `.jpg` files as they are easier to work with.  We will filter out for the respective files and save it as a new dataframe.

In [5]:
# Create a new dataframe with only the jpeg images
consumer_lookup = consumer[consumer['Image'].str[-4:] == '.JPG']
consumer_lookup.head()

Unnamed: 0,NDC11,Part,Image,Layout,Name
2,2322730,1,PillProjectDisc10/images/79U-YY6M1UUR6F127ZMAC...,C3PI_Test,STRATTERA 10MG
3,2322730,1,PillProjectDisc11/images/7WVFV5H74!ELFNQ_GUH92...,C3PI_Test,STRATTERA 10MG
4,2322730,1,PillProjectDisc20/images/B4CH0R9B7PEQ6GORRX-8X...,C3PI_Test,STRATTERA 10MG
5,2322730,1,PillProjectDisc21/images/B5T5HI5XI8X2HSBJL-TGD...,C3PI_Test,STRATTERA 10MG
6,2322730,1,PillProjectDisc25/images/B8S621VZTSXR4Z4VRHYLT...,C3PI_Test,STRATTERA 10MG


In [6]:
# Resetting the index
consumer_lookup.reset_index(inplace = True)

In [7]:
# Removing unnecessary columns
consumer_lookup.drop(columns = ['index', 'NDC11', 'Part', 'Layout'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [8]:
consumer_lookup.head()

Unnamed: 0,Image,Name
0,PillProjectDisc10/images/79U-YY6M1UUR6F127ZMAC...,STRATTERA 10MG
1,PillProjectDisc11/images/7WVFV5H74!ELFNQ_GUH92...,STRATTERA 10MG
2,PillProjectDisc20/images/B4CH0R9B7PEQ6GORRX-8X...,STRATTERA 10MG
3,PillProjectDisc21/images/B5T5HI5XI8X2HSBJL-TGD...,STRATTERA 10MG
4,PillProjectDisc25/images/B8S621VZTSXR4Z4VRHYLT...,STRATTERA 10MG


In [9]:
# Checking for null values
consumer_lookup.isna().sum()

Image    0
Name     0
dtype: int64

In [10]:
# Looking at how many jpg images there are for each pill
consumer_lookup.groupby(consumer_lookup['Name']).count().head()

Unnamed: 0_level_0,Image
Name,Unnamed: 1_level_1
ABACAVIR and LAMIVUDINE Tablets USP,3
ABILIFY 15MG YELLOW TABS,18
ABILIFY 20MG WHITE TABS,18
ABILIFY 2MG GREEN TABS,16
ABILIFY 30MG PINK ROUND TABS,18


It seems that a single medication has multiple pictures.

In [11]:
# Looking at total rows / images in new dataframe
consumer_lookup.shape[0]

59414

Given that all listed medications have multiple file types, including jpeg, we want to check that the new `consumer_lookup` dataframe maintains the same amount of medications as the original `consumer` dataframe, which is 3,010.

In [12]:
# Looking at the total amount of medications in the dataframe
consumer_lookup.groupby(consumer_lookup['Name']).count().shape[0]

3009

It looks like we lost 1 medication.  So we can expect a total of 59,414 images across 3,009 types of medications to download.  As we also noted above, each medication also ha multiple images - so we have a good amount of potential training data to work with.

---
### 03. Amending Directory

In this section, we look to add an extra column (and remove unnecessary ones after) that will point us to the download link for each image referenced in the `Image` column of the `consumer_lookup` directory.

In [13]:
consumer_lookup.head()

Unnamed: 0,Image,Name
0,PillProjectDisc10/images/79U-YY6M1UUR6F127ZMAC...,STRATTERA 10MG
1,PillProjectDisc11/images/7WVFV5H74!ELFNQ_GUH92...,STRATTERA 10MG
2,PillProjectDisc20/images/B4CH0R9B7PEQ6GORRX-8X...,STRATTERA 10MG
3,PillProjectDisc21/images/B5T5HI5XI8X2HSBJL-TGD...,STRATTERA 10MG
4,PillProjectDisc25/images/B8S621VZTSXR4Z4VRHYLT...,STRATTERA 10MG


In [14]:
# Looking at the full value of an individual Image
consumer_lookup['Image'][0]

'PillProjectDisc10/images/79U-YY6M1UUR6F127ZMACIWPEEXHLB.JPG'

In [15]:
# Instantiating the base url for the source of the images
url = 'ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/Pills/'

# Printing what the base url + the above Image full value would look like
print(url + consumer_lookup['Image'][0])

ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/Pills/PillProjectDisc10/images/79U-YY6M1UUR6F127ZMACIWPEEXHLB.JPG


The above is the full download link for the image referenced in the first row of the `consumer_lookup` directory.  Next we will create a column within the directory that will be populated with the full url download link for each image in the directory.  We will also remove the `Image` column after, since we do not need it going forward.

In [16]:
# Create new column
consumer_lookup['full_url'] = [url + consumer_lookup['Image'][i] for i in range(len(consumer_lookup['Image']))]

# Removing Image column
consumer_lookup.drop(columns = ['Image'], inplace = True)

# Looking at the result of the last 2 steps
consumer_lookup.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,Name,full_url
0,STRATTERA 10MG,ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/...
1,STRATTERA 10MG,ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/...
2,STRATTERA 10MG,ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/...
3,STRATTERA 10MG,ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/...
4,STRATTERA 10MG,ftp://lhcftp.nlm.nih.gov/Open-Access-Datasets/...


---
### 04. Saving Down Changes

In this final section, we will be saving down `consumer_lookup` as a csv file in order to pull into the next notebook - which will download the respective images.

In [17]:
consumer_lookup.to_csv('../data/consumer_lookup.csv')

In the following notebook, we will use the saved file to download each individual images to our local drive.

---