## Overview  of the various datasets used in the analysis.

There are five datasets listed below.  


1. Original dataframe: This was the original dataframe created by the other researchers(1) and one I used as a starting point for my analysis. FYI this is identical to the 'records.sql' file used in the prep_data.py class.

2. Other researchers predictor matrix: The second dataset is the predictor matrix used by the other researchers for their algorithm predictions. This dataset contains an intercept which if excluded would produce a sparsity level of around 82%.

3. My feature set: The third dataset is the feature set/attributes selected from the original dataframe that I decided to work with. 

4. My predictor matrix: These are integer encoded tokens to be passed to the first layer of the neural network which is the embeddings layer.

5. My embeddings: This is the output from the first layer of the net, the embedding layer. In this analysis embeddings are allowed to evolve as the net trains with one loss function. That loss function captures the error in predicting whether a pet from the test data set will be adopted or not. It is possible to freeze the embeddings if desired. It is also possible to use other embeddings derived by other models and use those as a starting point for the net. eg Google's word2vec, Standford's Glove embeddings, Google's BERT Wordpiece etc..

#### Sept/20

(1) Researchers report(login required): https://data.world/rdowns26/austin-animal-shelter/workspace/file?filename=Project+Report.pdf


In [1]:
#Impore core dependencies.
import numpy as np
import pandas as pd 

### 1.  Original dataframe.

In [2]:
original_dataframe= pd.read_csv('.....pets.csv',index_col=0)
original_datframe.head()

Unnamed: 0,Animal ID,Name_intake,DateTime_intake,MonthYear_intake,Found_Location,Intake_Type,IntakeCondition,Animal_Type_intake,Sex,Age,...,DateTime_outcome,MonthYear_outcome,DOB,Outcome_Type,Outcome_Subtype,Animal_Type_outcome,Sex_upon_Outcome,Age_upon_Outcome,Breed_outcome,Color_outcome
0,A730601,,07/07/2016 12:11:00 PM,07/07/2016 12:11:00 PM,1109 Shady Ln in Austin (TX),Stray,Normal,Cat,Intact Male,7 months,...,07/08/2016 09:00:00 AM,07/08/2016 09:00:00 AM,12/07/2015,Transfer,SCRP,Cat,Neutered Male,7 months,Domestic Shorthair Mix,Blue Tabby
1,A683644,*Zoey,07/13/2014 11:02:00 AM,07/13/2014 11:02:00 AM,Austin (TX),Owner Surrender,Nursing,Dog,Intact Female,4 weeks,...,11/06/2014 10:06:00 AM,11/06/2014 10:06:00 AM,06/13/2014,Adoption,Foster,Dog,Spayed Female,4 months,Border Collie Mix,Brown/White
2,A676515,Rico,04/11/2014 08:45:00 AM,04/11/2014 08:45:00 AM,615 E. Wonsley in Austin (TX),Stray,Normal,Dog,Intact Male,2 months,...,04/14/2014 06:38:00 PM,04/14/2014 06:38:00 PM,01/11/2014,Return to Owner,,Dog,Neutered Male,3 months,Pit Bull Mix,White/Brown
3,A742953,,01/31/2017 01:30:00 PM,01/31/2017 01:30:00 PM,S Hwy 183 And Thompson Lane in Austin (TX),Stray,Normal,Dog,Intact Male,2 years,...,02/04/2017 02:17:00 PM,02/04/2017 02:17:00 PM,01/31/2015,Transfer,Partner,Dog,Intact Male,2 years,Saluki,Sable/Cream
4,A679549,*Gilbert,05/22/2014 03:43:00 PM,05/22/2014 03:43:00 PM,124 W Anderson in Austin (TX),Stray,Normal,Cat,Intact Male,1 month,...,06/16/2014 01:54:00 PM,06/16/2014 01:54:00 PM,03/31/2014,Transfer,Partner,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black/White


### 2. Other researchers predictor matrix.

In [3]:
other_predictor_matrix= pd.read_csv('....other_predictor_matrix.csv',index_col=0) 
other_predictor_matrix.head()

Unnamed: 0,Intercept,Gender[T.Male],Age_Bucket[T.1-6 months],Age_Bucket[T.1-6 weeks],Age_Bucket[T.4-6 years],Age_Bucket[T.7+ years],Age_Bucket[T.7-12 months],Age_Bucket[T.Less than 1 week],Animal_Type_intake[T.Cat],Animal_Type_intake[T.Dog],...,beagle,terrier,boxer,poodle,rottweiler,dachshund,chihuahua,pit_bull,Have_name_start,Have_name_end
0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


### 3. My feature set.

In [11]:
my_feature_matrix= pd.read_csv('....my_features.csv',index_col=0)
my_feature_matrix.head() 

Unnamed: 0,Name_intake,Intake_Type,IntakeCondition,Animal_Type_intake,Sex,Age,Breed_intake,Color_intake,Age_upon_Outcome,Days_length,day_intake,month_intake,year_intake
0,missing,Stray,Normal,Cat,Intact Male,7 months,Domestic Shorthair Mix,Blue Tabby,7 months,0-7 days,7,7,2016
1,*Zoey,Owner Surrender,Nursing,Dog,Intact Female,4 weeks,Border Collie Mix,Brown/White,4 months,12 weeks - 6 months,13,7,2014
2,Rico,Stray,Normal,Dog,Intact Male,2 months,Pit Bull Mix,White/Brown,3 months,0-7 days,11,4,2014
3,missing,Stray,Normal,Dog,Intact Male,2 years,Saluki,Sable/Cream,2 years,0-7 days,31,1,2017
4,*Gilbert,Stray,Normal,Cat,Intact Male,1 month,Domestic Shorthair Mix,Black/White,2 months,3-6 weeks,22,5,2014


### 4. My predictor matrix.

In [5]:
my_predictor_matrix= pd.read_csv('....my_predictor_matrix.csv',index_col=0)
my_predictor_matrix.head() 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,3255,32,33,1,7,37,17,5,2,48,...,26,2,20,77,30,28,0,0,0,0
1,607,6,1,7,10,14,8,2,43,42,...,4,18,20,77,8,28,0,0,0,0
2,722,6,1,7,10,17,31,12,48,21,...,15,19,12,4,9,16,5,36,27,0
3,2684,6,1,7,34,14,8,2,119,95,...,2,4,8,2,73,9,25,0,0,0
4,20,66,57,56,49,4,18,89,3,23,...,5,11,38,5,25,0,0,0,0,0


### 5. My embedding matrix.  

In [10]:
#Full (4d) embedding matrix not uploaded to GitHub given its size. Below is the top of that matrix
embedding_matrix = np.load('....my_embeddings.npy') 
embedding_matrix[0:1]

array([[[[-0.03929623, -0.04125507, -0.02005602, ..., -0.00473398,
           0.04301593,  0.00960034],
         [ 0.04959803,  0.03936771, -0.01722408, ..., -0.00368117,
          -0.01888877,  0.02015176],
         [ 0.02169741, -0.00024492, -0.00335282, ...,  0.0200487 ,
          -0.00496046, -0.00632836],
         ...,
         [-0.01698695,  0.02488183, -0.02087884, ...,  0.01798498,
           0.01846988, -0.00617523],
         [-0.01698695,  0.02488183, -0.02087884, ...,  0.01798498,
           0.01846988, -0.00617523],
         [-0.01698695,  0.02488183, -0.02087884, ...,  0.01798498,
           0.01846988, -0.00617523]],

        [[-0.02905277, -0.00379145,  0.02618489, ...,  0.0463664 ,
          -0.01328101, -0.01057801],
         [ 0.0338946 , -0.03762949, -0.03374013, ..., -0.02760495,
           0.01906364,  0.00089316],
         [ 0.0415062 , -0.02502749,  0.01545482, ..., -0.00990937,
           0.00428069, -0.01977917],
         ...,
         [-0.01698695,  0.02488183