[View in Colaboratory](https://colab.research.google.com/github/cxl923cc/Avito/blob/master/Avito_Image_Feature_Engineering_for_test_set.ipynb)

# Avito Demand Prediction

**Step 1 - Preparation:**
  * Import the libraries. Define the threshold of image recognisability
  * Set up Kaggle API
  * Download image data and train master (zipped file), unzip the files
  * Download Resnet50 from keras library
  * Download test data and unzip the file

In [2]:
import os

import numpy as np
import pandas as pd
from keras.preprocessing import image
import keras.applications.resnet50 as resnet50
import keras.applications.xception as xception
import keras.applications.inception_v3 as inception_v3
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

from datetime import datetime
import zipfile
import cv2
from google.colab import files

#Set threshold of recognisability
p_top_thresh = 0.8

Using TensorFlow backend.


In [3]:
#@title
!pip install kaggle

Collecting kaggle
  Downloading https://files.pythonhosted.org/packages/bd/a6/d93a9492ad8f31b1a0d17225acfa066a38a27f5fc2ce9fc5034a7003fff1/kaggle-1.3.6.tar.gz
Building wheels for collected packages: kaggle
  Running setup.py bdist_wheel for kaggle ... [?25l- \ done
[?25h  Stored in directory: /content/.cache/pip/wheels/98/be/57/a576a1f2f50f5c3bebd0c08fc3b2a6881dfde31c8217014978
Successfully built kaggle
Installing collected packages: kaggle
Successfully installed kaggle-1.3.6


In [4]:
from googleapiclient.discovery import build
import io, os
from googleapiclient.http import MediaIoBaseDownload
from google.colab import auth
auth.authenticate_user()
drive_service = build('drive', 'v3')
results = drive_service.files().list(
        q="name = 'kaggle.json'", fields="files(id)").execute()
kaggle_api_key = results.get('files', [])
filename = "/content/.kaggle/kaggle.json"
os.makedirs(os.path.dirname(filename), exist_ok=True)
request = drive_service.files().get_media(fileId=kaggle_api_key[0]['id'])
fh = io.FileIO(filename, 'wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
    status, done = downloader.next_chunk()
    print("Download %d%%." % int(status.progress() * 100))
os.chmod(filename, 600)

Download 100%.


In [5]:
#!kaggle competitions list
!cat /proc/meminfo

MemTotal:       13341832 kB
MemFree:         1077780 kB
MemAvailable:   12641740 kB
Buffers:          131960 kB
Cached:         11077396 kB
SwapCached:            0 kB
Active:          1958068 kB
Inactive:        9602616 kB
Active(anon):     351700 kB
Inactive(anon):      280 kB
Active(file):    1606368 kB
Inactive(file):  9602336 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:              1892 kB
Writeback:             0 kB
AnonPages:        351416 kB
Mapped:           195696 kB
Shmem:               664 kB
Slab:             632984 kB
SReclaimable:     608672 kB
SUnreclaim:        24312 kB
KernelStack:        3072 kB
PageTables:         4496 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     6670916 kB
Committed_AS:    1302956 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
AnonHugePag

In [5]:
!kaggle competitions files -c avito-demand-prediction

name                    size  creationDate         
---------------------  -----  -------------------  
sample_submission.csv    8MB  2018-04-23 21:26:51  
test_jpg.zip            19GB  2018-04-23 22:09:42  
train_jpg.zip           49GB  2018-04-23 23:06:39  
test.csv.zip           107MB  2018-04-24 18:25:00  
periods_test.csv.zip   136MB  2018-04-24 18:25:02  
periods_train.csv.zip  170MB  2018-04-24 18:25:04  
train.csv.zip          308MB  2018-04-24 18:26:58  
test_active.csv.zip      2GB  2018-04-24 18:31:08  
train_active.csv.zip     3GB  2018-04-24 18:44:13  
train_jpg_4.zip         10GB  2018-05-01 22:40:01  
train_jpg_2.zip         10GB  2018-05-01 22:40:05  
train_jpg_1.zip         10GB  2018-05-01 22:40:06  
train_jpg_0.zip         10GB  2018-05-01 22:40:06  
train_jpg_3.zip         10GB  2018-05-01 22:40:06  


In [5]:
!kaggle competitions download -c avito-demand-prediction -f train_jpg_0.zip   
!kaggle competitions download -c avito-demand-prediction -f train_jpg_1.zip  
!kaggle competitions download -c avito-demand-prediction -f train_jpg_2.zip 
!kaggle competitions download -c avito-demand-prediction -f train_jpg_3.zip 
!kaggle competitions download -c avito-demand-prediction -f train_jpg_4.zip 
#2 mins

train_jpg_0.zip: Downloaded 10GB of 10GB


In [6]:
%%time
!kaggle competitions download -c avito-demand-prediction -f test_jpg.zip 

test_jpg.zip: Downloaded 19GB of 19GB
CPU times: user 30 s, sys: 10 s, total: 40 s
Wall time: 12min 49s


In [7]:
#!kaggle competitions download -c avito-demand-prediction -f train.csv.zip
#!unzip .kaggle/competitions/avito-demand-prediction/train.csv.zip
!kaggle competitions download -c avito-demand-prediction -f test.csv.zip
!unzip .kaggle/competitions/avito-demand-prediction/test.csv.zip
#After unzip, train.csv is under the home folder


test.csv.zip: Downloaded 107MB of 107MB
Archive:  .kaggle/competitions/avito-demand-prediction/test.csv.zip
  inflating: test.csv                


In [66]:
!ls .kaggle/competitions/avito-demand-prediction/

test.csv.zip  test_jpg.zip  train.csv.zip


In [8]:
resnet_model = resnet50.ResNet50(weights='imagenet')
inception_model = inception_v3.InceptionV3(weights='imagenet')
xception_model = xception.Xception(weights='imagenet')

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.2/resnet50_weights_tf_dim_ordering_tf_kernels.h5
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.5/inception_v3_weights_tf_dim_ordering_tf_kernels.h5
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.4/xception_weights_tf_dim_ordering_tf_kernels.h5



**Step 2 - Read Data:**
  * Read train master file into a dataframe 'train_df'
  * Read all the images from the zipped image archive

In [9]:
test_df = pd.read_csv('test.csv')
test_df.head()

Unnamed: 0,item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1
0,6544e41a8817,dbe73ad6e4b5,Волгоградская область,Волгоград,Личные вещи,Детская одежда и обувь,Для мальчиков,Обувь,25.0,Отдам бесплатно,На ангарском,,66,2017-04-18,Private,a8b57acb5ab304f9c331ac7a074219aed4d349d8aef386...,2020.0
1,65b9484d670f,2e11806abe57,Свердловская область,Нижняя Тура,Хобби и отдых,Велосипеды,Дорожные,,,Продам велосипед,"Продам велосипед KAMA F200,в нормальном состо...",3000.0,4,2017-04-16,Private,,
2,8bab230b2ecd,0b850bbebb10,Новосибирская область,Бердск,Бытовая электроника,Аудио и видео,Телевизоры и проекторы,,,BBK,Продам новый телевизор BBK 32 диагональ смарт...,15000.0,15,2017-04-17,Private,8c361112cb049745ef2d1b0ae73594fc5c107286b0c942...,2960.0
3,8e348601fefc,5f1d5c3ce0da,Саратовская область,Саратов,Для дома и дачи,Бытовая техника,Для кухни,Вытяжки,,Вытяжка Jetair 60,"Продам новую вытяжку в упаковке,с документами....",4500.0,70,2017-04-17,Private,,
4,8bd2fe400b89,23e2d97bfc7f,Оренбургская область,Бузулук,Личные вещи,Товары для детей и игрушки,Детские коляски,,,Коляска зима-лето,Продам отличную коляску. б/у 1 год. все вопрос...,4900.0,15,2017-04-15,Private,bc3cf6deef10840fc302e38eb48fa7748aa1e28d534f8f...,1002.0


In [0]:
#Specify the zip archive
def read_zip():
    print('processing test_jpg')

    start=datetime.now()

    #Create directory to store the images extracted from the zip archive
    images_dir = os.path.expanduser(os.path.join('~', 'avito_images_test'))
    if not os.path.exists(images_dir):
        os.makedirs(images_dir)

    #Extract images and save them under the directory
    with zipfile.ZipFile('.kaggle/competitions/avito-demand-prediction/test_jpg.zip', 'r') as test_zip:
        files_in_zip = sorted(test_zip.namelist())
        for idx, file in enumerate(files_in_zip):
            if file.endswith('.jpg'):
                test_zip.extract(file, path=images_dir)

    #Scan all the images under the directory and save their path to a list that will be used to read the images
    image_files = [x.path for x in os.scandir('/content/avito_images_test/data/competition_files/test_jpg')]
    print('Total number of images read into the image directory: ', len(image_files))

    print (datetime.now()-start)
    
    return image_files


**Step 3 - Make Prediction:**
  * Make predictions on all images and take the top predicted class
  * Set the confidence threshold as 0.8 meaning when ResNet is 80% sure about the top predicted class, we assume the image is recognisable and assign a flag ('flag_clear_img') 1 to the image, otherwise 0.
  * Note: Few damaged images - train_jpg_0.zip has 278168 images but there are 2 damaged ones that can not be opened
  * Potential improvements: Add another two networks; compare similarity between the top 2 categories


In [0]:
#Create a flag to identify whether the top probability is larger than 80%
#Set threshold of the prediction for the most likely class

def extract_img_f(image_files):  
    start=datetime.now()

    #Set the total number of images
    tot_num_image = len(image_files)

    list_img_id = []
    #list_flag_clear_img = []
    list_img_size = []
    list_img_shape_ratio = []

    list_top_prob_resnet = []
    list_top_prob_inception = []
    list_top_prob_xception = []
    list_top_label_resnet = []
    list_top_label_inception = []
    list_top_label_xception = []
    
    for i in range(0, tot_num_image):
        flag_clear_img = 0
        #Skip the damaged images that exist but can not be opened, e.g. image_files[270883] exists but can't be opened
        if cv2.imread(image_files[i]) is not None:
            img = Image.open(image_files[i])
            #Get image size and the shape ratio (width/height)
            image_size = img.size[0]*img.size[1]
            image_shape_ratio = img.size[0]/img.size[1]
            """Classify image and return top matches."""
            target_size = (224, 224)
            if img.size != target_size:
                img = img.resize(target_size)
            x = image.img_to_array(img)
            x = np.expand_dims(x, axis=0)
            x = resnet50.preprocess_input(x)
            
            preds = resnet_model.predict(x)
            resnet_preds = resnet50.decode_predictions(preds, top=3)[0]

            preds = inception_model.predict(x)
            inception_preds = inception_v3.decode_predictions(preds, top=3)[0]

            preds = xception_model.predict(x)
            xception_preds = xception.decode_predictions(preds, top=3)[0]
            
            
            #Top probability
            top_prob_resnet = resnet_preds[0][2]
            top_prob_inception = inception_preds[0][2]
            top_prob_xception = xception_preds[0][2]
            top_label_resnet = resnet_preds[0][1]
            top_label_inception = inception_preds[0][1]
            top_label_xception = xception_preds[0][1]            

            list_top_prob_resnet.append(top_prob_resnet)
            list_top_prob_inception.append(top_prob_inception)
            list_top_prob_xception.append(top_prob_xception)
            list_top_label_resnet.append(top_label_resnet)
            list_top_label_inception.append(top_label_inception)
            list_top_label_xception.append(top_label_xception)            
            
            #Color
            #mean_color = np.mean(dat[1].flatten())
            
            

            list_img_size.append(image_size)
            list_img_shape_ratio.append(image_shape_ratio)
            list_img_id.append(image_files[i].split('/')[-1].replace('.jpg',''))
        #print(resnet_preds)
        
    df = pd.DataFrame({'image_id': list_img_id,
                       'image_size': list_img_size,
                       'image_shape_ratio': list_img_shape_ratio,
                       'top_prob_resnet':list_top_prob_resnet,
                       'top_prob_inception':list_top_prob_inception,
                       'top_prob_xception':list_top_prob_xception,
                       'top_label_resnet':list_top_label_resnet,
                       'top_label_inception':list_top_label_inception,
                       'top_label_xception':list_top_label_xception,                      
                      })
    
    print (datetime.now()-start)
    
    return df


**Step 4 - Compare the deal labelability of the recognisable and unrecognisable group:**
  * The average deal probability on the recognisable is slightly higher than the unrecognisable group. It is more obvious on the 10000 sample (16.3% vs 13.9%) than the 1000 sample (14.7% vs 13.1%).

In [0]:
def report_export(df):
    #Check how different the deal probability on the recognisable (ResNet is more than 80% confident) and unrecognisable images (ResNet is <= 80% confident)
    df_w_target = pd.merge(df, test_df[['image','item_id']], how = 'left', left_on='image_id', right_on='image')
    print('check missing rate from merge:', df_w_target.item_id.isnull().sum())
    #sns.countplot(df_w_target['flag_clear_img'])
    
    df_w_target.drop(['item_id'], axis=1).to_csv('test_jpg.csv')
    files.download('test_jpg.csv')


In [0]:
# Call the functions
image_files = read_zip()
df = extract_img_f(image_files = image_files)
report_export(df = df)

processing test_jpg


OSError: ignored

In [35]:
df

Unnamed: 0,image_id,image_shape_ratio,image_size,top_label_inception,top_label_resnet,top_label_xception,top_prob_inception,top_prob_resnet,top_prob_xception
0,000e90e5272bbba9399f10f4c76c4c97b1be5e7f6935ca...,1.333333,172800,bow,shoe_shop,ping-pong_ball,0.938033,0.469369,0.962981
1,0003cda6edcbe2e2cfdc700380488e40da6ce9dd09e762...,1.508333,195480,stopwatch,sandal,bib,0.981496,0.180341,0.586537
2,0007489d6d82820e7e24e823d6f8b06d570454dfc8b7e4...,1.777778,230400,stopwatch,barrel,ping-pong_ball,0.644471,0.934163,0.532161
3,00052afea2d037b071fc92966602bc36681b50b63674fe...,0.750000,172800,web_site,Loafer,ping-pong_ball,0.999996,0.548104,0.893680
4,00003f4d0f91fa03e947b568ef83e03295ac735f47a082...,0.750000,172800,web_site,modem,nipple,1.000000,0.584703,0.999102
5,000e96167f3e2cf2d228252fa44aaa0aec133142527b68...,0.754717,171720,flatworm,hoopskirt,ping-pong_ball,0.996703,0.403403,0.352036
6,000320d07fb988932c1acbf31ea9289619fbb4bff0f01f...,1.333333,172800,web_site,sweatshirt,ping-pong_ball,0.999840,0.986492,0.796083
7,0007e3d331f3f66d481354d07c388cb64149af3abca9fb...,1.777778,230400,spotlight,tricycle,ping-pong_ball,0.863255,0.381521,0.997718
8,00074ccb45507f859fa323ba7a796b66477251d8967156...,0.750000,172800,stopwatch,clog,ping-pong_ball,0.920725,0.587212,0.998837
9,00005eb46a94b7fbeaaec8f571bf6c09117329de5bf744...,1.000000,129600,ashcan,steam_locomotive,Dutch_oven,0.970438,0.197454,0.570073


In [0]:
# Install the PyDrive wrapper & import libraries.
# This only needs to be done once in a notebook.
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
# This only needs to be done once in a notebook.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



In [26]:
# Create & upload a file.
uploaded = drive.CreateFile({'title': 'test_jpg.csv'})
uploaded.SetContentFile('test_jpg.csv')
uploaded.Upload()
print('Uploaded file with ID {}'.format(uploaded.get('id')))


Uploaded file with ID 1P5PHzBW6Uf9IOoAFqb8aAKe_DfdiqM7X


In [1]:
!ls

datalab
