In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from pylab import rcParams
rcParams['figure.figsize'] = 10, 10
import json

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
stop_words = stopwords.words('english')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Chiga\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Chiga\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

## File descriptions
- train.csv - Tabular/text data for the training set
- test.csv - Tabular/text data for the test set
- sample_submission.csv - A sample submission file in the correct format
- breed_labels.csv - Contains Type, and BreedName for each BreedID. Type 1 is dog, 2 is cat.
- color_labels.csv - Contains ColorName for each ColorID
- state_labels.csv - Contains StateName for each StateID

## Data Fields
- **PetID** - Unique hash ID of pet profile
- **Type** - Type of animal (1 = Dog, 2 = Cat)
- **Name** - Name of pet (Empty if not named)
- **Age** - Age of pet when listed, in months
- **Breed1** - Primary breed of pet (Refer to BreedLabels dictionary)
- **Breed2** - Secondary breed of pet, if pet is of mixed breed (Refer to BreedLabels dictionary)
- **Gender** - Gender of pet (1 = Male, 2 = Female, 3 = Mixed, if profile represents group of pets)
- **Color1** - Color 1 of pet (Refer to ColorLabels dictionary)
- **Color2** - Color 2 of pet (Refer to ColorLabels dictionary)
- **Color3** - Color 3 of pet (Refer to ColorLabels dictionary)
- **MaturitySize** - Size at maturity (1 = Small, 2 = Medium, 3 = Large, 4 = Extra Large, 0 = Not Specified)
- **FurLength** - Fur length (1 = Short, 2 = Medium, 3 = Long, 0 = Not Specified)
- **Vaccinated** - Pet has been vaccinated (1 = Yes, 2 = No, 3 = Not Sure)
- **Dewormed** - Pet has been dewormed (1 = Yes, 2 = No, 3 = Not Sure)
- **Sterilized** - Pet has been spayed / neutered (1 = Yes, 2 = No, 3 = Not Sure)
- **Health** - Health Condition (1 = Healthy, 2 = Minor Injury, 3 = Serious Injury, 0 = Not Specified)
- **Quantity** - Number of pets represented in profile
- **Fee** - Adoption fee (0 = Free)
- **State** - State location in Malaysia (Refer to StateLabels dictionary)
- **RescuerID** - Unique hash ID of rescuer
- **VideoAmt** - Total uploaded videos for this pet
- **PhotoAmt** - Total uploaded photos for this pet
- **Description** - Profile write-up for this pet. The primary language used is English, with some in Malay or Chinese.
- **AdoptionSpeed** - Categorical speed of adoption. Lower is faster. Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted. The values are determined in the following way: 
    - 0 - Pet was adopted on the same day as it was listed. 
    - 1 - Pet was adopted between 1 and 7 days (1st week) after being listed. 
    - 2 - Pet was adopted between 8 and 30 days (1st month) after being listed. 
    - 3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed. 
    - 4 - No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days).

## Images
For pets that have photos, they will be named in the format of PetID-ImageNumber.jpg. Image 1 is the profile (default) photo set for the pet. For privacy purposes, faces, phone numbers and emails have been masked.

## Image Metadata
We have run the images through Google's Vision API, providing analysis on Face Annotation, Label Annotation, Text Annotation and Image Properties. You may optionally utilize this supplementary information for your image analysis.

File name format is PetID-ImageNumber.json.

Some properties will not exist in JSON file if not present, i.e. Face Annotation. Text Annotation has been simplified to just 1 entry of the entire text description (instead of the detailed JSON result broken down by individual characters and words). Phone numbers and emails are already anonymized in Text Annotation.

Google Vision API reference: https://cloud.google.com/vision/docs/reference/rest/v1/images/annotate

## Sentiment Data
We have run each pet profile's description through Google's Natural Language API, providing analysis on sentiment and key entities. You may optionally utilize this supplementary information for your pet description analysis. There are some descriptions that the API could not analyze. As such, there are fewer sentiment files than there are rows in the dataset.

File name format is PetID.json.

Google Natural Language API reference: https://cloud.google.com/natural-language/docs/basics

In [4]:
# Explore CSV files
breed_labels = pd.read_csv('../Data/breed_labels.csv', header = 0)
color_labels = pd.read_csv('../Data/color_labels.csv', header = 0)
state_labels = pd.read_csv('../Data/state_labels.csv', header = 0)

In [5]:
breed_labels.head()

Unnamed: 0,BreedID,Type,BreedName
0,1,1,Affenpinscher
1,2,1,Afghan Hound
2,3,1,Airedale Terrier
3,4,1,Akbash
4,5,1,Akita


In [6]:
breed_labels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307 entries, 0 to 306
Data columns (total 3 columns):
BreedID      307 non-null int64
Type         307 non-null int64
BreedName    307 non-null object
dtypes: int64(2), object(1)
memory usage: 7.3+ KB


In [7]:
color_labels

Unnamed: 0,ColorID,ColorName
0,1,Black
1,2,Brown
2,3,Golden
3,4,Yellow
4,5,Cream
5,6,Gray
6,7,White


In [8]:
state_labels

Unnamed: 0,StateID,StateName
0,41336,Johor
1,41325,Kedah
2,41367,Kelantan
3,41401,Kuala Lumpur
4,41415,Labuan
5,41324,Melaka
6,41332,Negeri Sembilan
7,41335,Pahang
8,41330,Perak
9,41380,Perlis


From the above data sets, there doesn't seem to be anything abnormal about them except for the fact that in the *state_labels* table, the ``StateID`` numbers don't start with 1 and aren't in any sequential order. It might be necessary to standardize the ID numbers later in the analysis.

# Training Set

In [9]:
# Explore train.zip
train = pd.read_csv('../Data/train.zip', compression = 'zip', header = 0)
train.head()

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Health,Quantity,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed
0,2,Nibble,3,299,0,1,1,7,0,1,...,1,1,100,41326,8480853f516546f6cf33aa88cd76c379,0,Nibble is a 3+ month old ball of cuteness. He ...,86e1089a3,1.0,2
1,2,No Name Yet,1,265,0,1,1,2,0,2,...,1,1,0,41401,3082c7125d8fb66f7dd4bff4192c8b14,0,I just found it alone yesterday near my apartm...,6296e909a,2.0,0
2,1,Brisco,1,307,0,1,2,7,0,2,...,1,1,0,41326,fa90fa5b1ee11c86938398b60abc32cb,0,Their pregnant mother was dumped by her irresp...,3422e4906,7.0,3
3,1,Miko,4,307,0,2,1,2,0,2,...,1,1,150,41401,9238e4f44c71a75282e62f7136c6b240,0,"Good guard dog, very alert, active, obedience ...",5842f1ff5,8.0,2
4,1,Hunter,1,307,0,1,1,0,0,2,...,1,1,0,41326,95481e953f8aed9ec3d16fc4509537e8,0,This handsome yet cute boy is up for adoption....,850a43f90,3.0,2


In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14993 entries, 0 to 14992
Data columns (total 24 columns):
Type             14993 non-null int64
Name             13736 non-null object
Age              14993 non-null int64
Breed1           14993 non-null int64
Breed2           14993 non-null int64
Gender           14993 non-null int64
Color1           14993 non-null int64
Color2           14993 non-null int64
Color3           14993 non-null int64
MaturitySize     14993 non-null int64
FurLength        14993 non-null int64
Vaccinated       14993 non-null int64
Dewormed         14993 non-null int64
Sterilized       14993 non-null int64
Health           14993 non-null int64
Quantity         14993 non-null int64
Fee              14993 non-null int64
State            14993 non-null int64
RescuerID        14993 non-null object
VideoAmt         14993 non-null int64
Description      14981 non-null object
PetID            14993 non-null object
PhotoAmt         14993 non-null float64
AdoptionSpe

In [11]:
train.describe()

Unnamed: 0,Type,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,FurLength,Vaccinated,Dewormed,Sterilized,Health,Quantity,Fee,State,VideoAmt,PhotoAmt,AdoptionSpeed
count,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0,14993.0
mean,1.457614,10.452078,265.272594,74.009738,1.776162,2.234176,3.222837,1.882012,1.862002,1.467485,1.731208,1.558727,1.914227,1.036617,1.576069,21.259988,41346.028347,0.05676,3.889215,2.516441
std,0.498217,18.15579,60.056818,123.011575,0.681592,1.745225,2.742562,2.984086,0.547959,0.59907,0.667649,0.695817,0.566172,0.199535,1.472477,78.414548,32.444153,0.346185,3.48781,1.177265
min,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,41324.0,0.0,0.0,0.0
25%,1.0,2.0,265.0,0.0,1.0,1.0,0.0,0.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,2.0,2.0
50%,1.0,3.0,266.0,0.0,2.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,3.0,2.0
75%,2.0,12.0,307.0,179.0,2.0,3.0,6.0,5.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,0.0,41401.0,0.0,5.0,4.0
max,2.0,255.0,307.0,307.0,3.0,7.0,7.0,7.0,4.0,3.0,3.0,3.0,3.0,3.0,20.0,3000.0,41415.0,8.0,30.0,4.0


In the training data, there only seem to be missing values in the ``Name`` and ``Description`` columns. That doesn't pose too much of a problem since names could always change and there are other methods to get the description of a pet, i.e. photos. Some word processing might have to be done to fill out the ``Description`` column. First, we'll explore the ``Name`` column. One thing that was noticed by glancing at the data set is that some names were inputted as *No Name Yet* which must be accounted for as a *null* value. That also begs the question if there are any other ways to state if a pet has no name, i.e. *No Name*, *Unnamed*, or *Unknown*.

There are other files part of the training set that should also be looked at also such as the *metadata* and *sentiment* data. Each provides context of the provided images and description of each pet. Both sets of data are stored as json files according to each ``PetID``.

In [9]:
# Open random metadata file for exploration
with open('../Data/train_metadata/000a290e4-1.json', 'r') as json_file:
    json_data = json.load(json_file)

json_data

{'labelAnnotations': [{'mid': '/m/0bt9lr',
   'description': 'dog',
   'score': 0.96414083,
   'topicality': 0.96414083},
  {'mid': '/m/0kpmf',
   'description': 'dog breed',
   'score': 0.9419755,
   'topicality': 0.9419755},
  {'mid': '/m/01z5f',
   'description': 'dog like mammal',
   'score': 0.92154,
   'topicality': 0.92154},
  {'mid': '/m/02xl47d',
   'description': 'dog breed group',
   'score': 0.8994595,
   'topicality': 0.8994595},
  {'mid': '/m/0393qn',
   'description': 'phalÃ¨ne',
   'score': 0.71789825,
   'topicality': 0.71789825},
  {'mid': '/m/01lrl',
   'description': 'carnivoran',
   'score': 0.7058321,
   'topicality': 0.7058321},
  {'mid': '/m/01pkw7',
   'description': 'papillon',
   'score': 0.6653916,
   'topicality': 0.6653916},
  {'mid': '/m/03yl64',
   'description': 'companion dog',
   'score': 0.6042771,
   'topicality': 0.6042771},
  {'mid': '/m/0fxnkq',
   'description': 'moscow watchdog',
   'score': 0.6030931,
   'topicality': 0.6030931},
  {'mid': '/m

By looking at what's contained in the metadata, it doesn't seem much information could be extracted from this data that's not already inputted in the train set. It looks like these JSON files contain Google Vision's analysis of which type of animals are in the images and the colors present in the image, both of which are already present in the train data.

In [10]:
# Open random sentiment file for exploration
with open('../Data/train_sentiment/000a290e4.json', 'r') as json_file:
    json_data = json.load(json_file)

json_data

{'sentences': [{'text': {'content': 'went to teluk kumba kuanthai restaurant saw this female puppies alone by the beach..',
    'beginOffset': -1},
   'sentiment': {'magnitude': 0.1, 'score': 0.1}},
  {'text': {'content': 'Adopters must vaccinate, spay and keep puppy indoors/fenced Call/WhatsApp: Address: teluk kumba',
    'beginOffset': -1},
   'sentiment': {'magnitude': 0.5, 'score': 0.5}}],
 'tokens': [],
 'entities': [{'name': 'restaurant',
   'type': 'LOCATION',
   'metadata': {},
   'salience': 0.26085824,
   'mentions': [{'text': {'content': 'restaurant', 'beginOffset': -1},
     'type': 'COMMON'}]},
  {'name': 'puppies',
   'type': 'OTHER',
   'metadata': {},
   'salience': 0.20370758,
   'mentions': [{'text': {'content': 'puppies', 'beginOffset': -1},
     'type': 'COMMON'}]},
  {'name': 'beach',
   'type': 'LOCATION',
   'metadata': {},
   'salience': 0.18226475,
   'mentions': [{'text': {'content': 'beach', 'beginOffset': -1},
     'type': 'COMMON'}]},
  {'name': 'Call',
   

From the sentiment files, it might be useful to extract the *documentSentiment* ``magnitude`` and ``score`` as it might be useful as a quantifier that describes the pet's situation.

In [15]:
def add_sentiment(df, sentiment_folder):
    '''The purpose of this function is to extract the sentiment magnitude and score from
    a pets corresponding JSON file and append it to the given data frame'''
    
    # Check if sentiment_folder is of type string
    if type(sentiment_folder) != str:
        raise ValueError('sentiment_folder must be of type str')
    
    # Extract each PetID from the data frame
    # This should be the same length as the data frame
    pet_ids = df['PetID'].unique()
    
    if len(pet_ids) != len(df):
        raise Exception('Number of unique PetID not equal to length of dataframe')
    
    # Initialize an empty data frame
    sentiment = pd.DataFrame(index = pet_ids, columns = ['des_sent_mag', 'des_sent_score'])
    
    # Extract sentiment magnitude and score from sentiment files
    for pet_id in pet_ids:
        try:
            with open('../Data/' + sentiment_folder + '/' + pet_id + '.json', 'r') as json_file:
                json_data = json.load(json_file)
                sentiment.loc[pet_id, 'des_sent_mag'] = json_data['documentSentiment']['magnitude']
                sentiment.loc[pet_id, 'des_sent_score'] = json_data['documentSentiment']['score']
        except:
            continue
    
    # Fill missing values with 0 to represent neutrality
    sentiment.fillna(0, inplace = True)
    
    # Append sentiment to df
    df_sentiment = df.merge(sentiment, how='left', left_on='PetID', right_index=True)
    
    # Return df_sentiment
    return df_sentiment

In [16]:
# Append sentiment data to train using add_sentiment
train_sentiment = add_sentiment(train, 'train_sentiment')

In [17]:
train_sentiment.head()

Unnamed: 0,Type,Name,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,...,Fee,State,RescuerID,VideoAmt,Description,PetID,PhotoAmt,AdoptionSpeed,des_sent_mag,des_sent_score
0,2,Nibble,3,299,0,1,1,7,0,1,...,100,41326,8480853f516546f6cf33aa88cd76c379,0,Nibble is a 3+ month old ball of cuteness. He ...,86e1089a3,1.0,2,2.4,0.3
1,2,No Name Yet,1,265,0,1,1,2,0,2,...,0,41401,3082c7125d8fb66f7dd4bff4192c8b14,0,I just found it alone yesterday near my apartm...,6296e909a,2.0,0,0.7,-0.2
2,1,Brisco,1,307,0,1,2,7,0,2,...,0,41326,fa90fa5b1ee11c86938398b60abc32cb,0,Their pregnant mother was dumped by her irresp...,3422e4906,7.0,3,3.7,0.2
3,1,Miko,4,307,0,2,1,2,0,2,...,150,41401,9238e4f44c71a75282e62f7136c6b240,0,"Good guard dog, very alert, active, obedience ...",5842f1ff5,8.0,2,0.9,0.9
4,1,Hunter,1,307,0,1,1,0,0,2,...,0,41326,95481e953f8aed9ec3d16fc4509537e8,0,This handsome yet cute boy is up for adoption....,850a43f90,3.0,2,3.7,0.6


## ``Name``

In [None]:
# Convert all letters of names to lower case
train_sentiment['Name'] = train_sentiment['Name'].str.lower()

In [None]:
train_names = train_sentiment[~train_sentiment['Name'].isnull()]

In [None]:
train_unnamed = train_names[train_names['Name'].str.contains('name')]

In [None]:
train_unnamed.head(10)

In [None]:
train_name_unknown = train_names[train_names['Name'].str.contains('unknown')]
train_name_unknown.head()

As shown in the two new data frames above, *train_unnamed* and *train_name_unknown*, there are a number of instances where the pet(s) don't actually have a name and are inputted, in different ways, as such. However, as stated above, names might not matter since they can always be changed so instead, a column standardizing/quantifying if a pet has a name could be necessary.

In [None]:
def add_named(df):
    '''This function should add a column called "named" to the specified
    data frame if the "Name" column contains "name" or "unknown".'''
    
    # Add dummy column for named pets
    df['named'] = 1
    df['Name'].fillna('no name yet', inplace=True)

    # Set named to 0 if pet is unnamed
    df.loc[df['Name'].str.contains('name'), 'named'] = 0
    df.loc[df['Name'].str.contains('unknown'), 'named'] = 0
    
    return df

In [None]:
# Apply add_named to train
train_named = add_named(train_sentiment)

A few exceptions could be made from the function above. The first is if a genuine pets name contains "name", it'll get categorized as not having a name with this function. Second has to do with the ``Quantity`` column. If the ``Quantity`` is greater than 1 and one pet isn't named, then all will be considered unnamed. I'm going to assume that both these cases aren't common and can be ignored but the ``Quantity`` column is something to look for moving forward.

## ``Quantity``

The ``Quantity`` column is a tricky variable to handle since it could affect how the other columns are inputted. For example, all the pets names are inputted into the ``Name`` column. Does that mean if it's a group of different breeds, will they be inputted in ``Breed1`` and ``Breed2``? What about color and the other columns?

In [None]:
# Filter Quantity greater than 1
multiple_pets = train_named[train_named['Quantity'] > 1]

print(len(multiple_pets))
multiple_pets.head()

In [None]:
# Filter Breed2 > 0 from multiple_pets
mixed_multiple_pets = multiple_pets[multiple_pets['Breed2'] > 0]

print(len(mixed_multiple_pets))
mixed_multiple_pets.head()

Given the number of listings of multiple pets, it's recommended that they not be removed from any further analysis but it still poses an issue. Further analysis could be done on the pictures/videos to individualize each pet into their own separate listing with their own unique features. However, even that involves a completely different project in itself as an algorithm with have to be generated to identify the species, breed, color, and gender of the animal which the Google Vision API could help produce. For now and for purposes of simplifying the analysis, it'll be assumed that each pet in a listing with a ``Quantity`` greater than 1 have similar features (breed and color).

## ``Breed1`` & ``Breed2``

Speaking of simplifying the analysis, the two columns ``Breed1`` and ``Breed2`` could be condensed into one by adding a column to represent if a pet is a pure breed or mixed.

In [12]:
# Investigate observations with Breed1 == 0
breed1_missing = train_named[train_named['Breed1'] == 0]
breed1_missing.describe()

Unnamed: 0,Type,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,FurLength,Vaccinated,Dewormed,Sterilized,Health,Quantity,Fee,State,VideoAmt,PhotoAmt,AdoptionSpeed
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,1.2,16.6,0.0,222.2,2.0,1.6,2.8,2.8,1.6,1.2,1.6,1.4,2.0,1.0,1.4,20.2,41343.0,0.0,4.8,3.2
std,0.447214,30.980639,0.0,117.357147,0.707107,0.547723,1.923538,3.834058,0.547723,0.447214,0.547723,0.547723,0.0,0.0,0.894427,44.611658,32.710854,0.0,4.024922,0.83666
min,1.0,2.0,0.0,26.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,0.0,2.0
25%,1.0,2.0,0.0,205.0,2.0,1.0,2.0,0.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,3.0,3.0
50%,1.0,3.0,0.0,266.0,2.0,2.0,3.0,0.0,2.0,1.0,2.0,1.0,2.0,1.0,1.0,0.0,41326.0,0.0,5.0,3.0
75%,1.0,4.0,0.0,307.0,2.0,2.0,4.0,7.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,1.0,41336.0,0.0,5.0,4.0
max,2.0,72.0,0.0,307.0,3.0,2.0,5.0,7.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,100.0,41401.0,0.0,11.0,4.0


It appears that in the 5 instances where ``Breed1`` is missing, it was perhaps inputted as ``Breed2``. Regardless, if both ``Breed1`` and ``Breed2`` are inputted as a non-zeroes, then the pet will be considered a mixed-breed.

In [None]:
def add_mixed_breed(df):
    '''This function adds the column mixed_breed to the given data frame'''
    # Add and initialize mixed_breed column
    df['mixed_breed'] = 1
    
    # Set mixed_breed = 0 if Breed1 or Breed2 == 0
    df.loc[(df['Breed1'] == 0) | (df['Breed2'] == 0), 'mixed_breed'] = 0
    
    return df

In [None]:
train_mixed_breed = add_mixed_breed(train_named)

## ``Color1``, ``Color2``, ``Color3``

As with the two breed columns, the same will be done with the three color columns.

In [None]:
def add_mixed_color(df):
    '''This function adds the mixed_color column to the given data frame based on Color1, Color2, and 
    Color3'''
    df['mixed_color'] = 0
    
    # Set mixed_color = 1 if Color2 or Color3 is present
    df.loc[(df['Color2'] > 0) | (df['Color3'] > 0), 'mixed_color'] = 1
    
    return df

# Test Set

In [None]:
test = pd.read_csv('../Data/test.csv', header = 0)
test.head()

In [None]:
test.info()

As with the training set, there are a few instances in the ``Name`` and ``Description`` columns that are null. Again, it shouldn't pose too much of a problem.