# 07 - Model Deployment

by [Alejandro Correa Bahnsen](albahnsen.com/)

version 0.4, Feb 2017

## Part of the Tutorial [Practical Machine Learning](https://github.com/albahnsen/Tutorial_PracticalMachineLearning_Pycon)


Este notebook está protegido bajo la siguiente licencia [Creative Commons Attribution-ShareAlike 3.0 Unported License](https://creativecommons.org/licenses/by-sa/3.0/)

## Agenda:

1. Creando y guardando el modelo
2. Correr el modelo con una muestra
3. Exponer el modelo como una API

## Parte 1: Detección de Phishing

Phishing, por definición, es el acto de engañar a un usuario para poder obtener su información personal haciéndose pasar por una institución o entidad de confianza. Regularmente, es difícil para los usuario diferenciar entre sitios legítimos y maliciosos porque los segundos son construidos para verse idénticos a los primeros. Por lo tanto, existe una necesidad de construir mejores herramientas para combatir a los perpretadores del fraude.

In [None]:
import pandas as pd
import zipfile
with zipfile.ZipFile('data/phishing.csv.zip', 'r') as z:
    f = z.open('phishing.csv')
    data = pd.read_csv(f, index_col=False)

In [None]:
data.head()

Unnamed: 0,url,phishing
0,http://www.subalipack.com/contact/images/sampl...,1
1,http://fasc.maximecapellot-gypsyjazz-ensemble....,1
2,http://theotheragency.com/confirmer/confirmer-...,1
3,http://aaalandscaping.com/components/com_smart...,1
4,http://paypal.com.confirm-key-21107316126168.s...,1


In [None]:
data.tail()

Unnamed: 0,url,phishing
39995,http://www.diaperswappers.com/forum/member.php...,0
39996,http://posting.bohemian.com/northbay/Tools/Ema...,0
39997,http://www.tripadvisor.jp/Hotel_Review-g303832...,0
39998,http://www.baylor.edu/content/services/downloa...,0
39999,http://www.phinfever.com/forums/viewtopic.php?...,0


In [None]:
data.phishing.value_counts()

1    20000
0    20000
Name: phishing, dtype: int64

### Generando características

In [None]:
data.url[data.phishing==1].sample(50, random_state=1).tolist()

['http://dothan.com.co/gold/austspark/index.htm\n',
 'http://78.142.63.63/%7Enetsysco/process/fc1d9c7ea4773b7ff90925c2902cb5f2\n',
 'http://verify95.5gbfree.com/coverme2010/\n',
 'http://www.racom.com/uploads/productscat/bookmark/ii.php?.rand=13vqcr8bp0gud&cbcxt=mai&email=abuse@tradinghouse.ca\n',
 'http://www.cleanenergytci.com/components/update.logon.l3an7lofamerica/2342343234532534546347677898765432876543345687656543876/\n',
 'http://209.148.89.163/-/santander.co.uk/weblegn/AccountLogin.php\n',
 'http://senevi.com/confirmation/\n',
 'http://www.hellenkeller.cl/tmp/new/noticias/Modulo_de_Atualizacao_Bradesco/index2.php?id=PSO1AM04L3Q6PSBNVJ82QUCO0L5GBSY2KM2U9BYUEO14HCRDVZEMTRB3DGJO9HPT4ROC4M8HA8LRJD5FCJ27AD0NTSC3A3VDUJQX6XFG519OED4RW6Y8J8VC19EAAAO5UF21CHGHIP7W4AO1GM8ZU4BUBQ6L2UQVARVM\n',
 'http://internet-sicherheit.co/de/konflikt/src%3Dde/AZ00276ZZ75/we%3Dhs_0_2/sicherheit/konto_verifizieren/verifizierung.php\n',
 'http://alen.co/docs/cleaner\n',
 'http://rattanhouse.co/Atualizacao_

Contienen alguna de las siguiente características:
* https
* login
* .php
* .html
* @
* sign
* ?

In [None]:
keywords = ['https', 'login', '.php', '.html', '@', 'sign']

In [None]:
for keyword in keywords:
    data['keyword_' + keyword] = data.url.str.contains(keyword).astype(int)

Obtenemos algo de metadatos al rededor de estas URL
* Longitud de la URL
* Longitud del dominion
* ¿Se trata de una IP?
* Número de .com

In [None]:
data['lenght'] = data.url.str.len() - 2

In [None]:
domain = data.url.str.split('/', expand=True).iloc[:, 2]

In [None]:
data['lenght_domain'] = domain.str.len()

In [None]:
domain.head(12)

0                                    www.subalipack.com
1             fasc.maximecapellot-gypsyjazz-ensemble.nl
2                                    theotheragency.com
3                                    aaalandscaping.com
4     paypal.com.confirm-key-21107316126168.securepp...
5                              lcthomasdeiriarte.edu.co
6                                       livetoshare.org
7                                            www.i-m.co
8                                     manuelfernando.co
9                                www.bladesmithnews.com
10                                      www.rasbaek.com
11                                      199.231.190.160
Name: 2, dtype: object

In [None]:
data['isIP'] = (domain.str.replace('.', '') * 1).str.isnumeric().astype(int)

In [None]:
data['count_com'] = data.url.str.count('com')

In [None]:
data.sample(15, random_state=4)

Unnamed: 0,url,phishing,keyword_https,keyword_login,keyword_.php,keyword_.html,keyword_@,keyword_sign,lenght,lenght_domain,isIP,count_com
28607,http://pennstatehershey.org/web/ibd/home/event...,0,0,0,0,0,0,0,80,20,0,0
3689,http://guiadesanborja.com/multiprinter/muestra...,1,0,1,1,0,0,0,81,18,0,1
6405,http://paranaibaweb.com/faleconosco/accounting...,1,0,0,0,1,0,0,65,16,0,1
35355,http://courts.delaware.gov/Jury%20Services/Hel...,0,0,0,0,0,0,0,94,19,0,0
16520,http://erpa.co/tmp/getproductrequest.htm\n,1,0,0,0,0,0,0,39,7,0,0
16196,http://pulapulapipoca.com/components/com_media...,1,0,1,1,0,0,0,239,18,0,4
3810,http://www.dag.or.kr/zboard/icon/visa/img/Atua...,1,0,0,0,0,0,0,62,13,0,0
3005,http://www.amazingdressup.com/wp-content/theme...,1,0,0,0,1,0,0,94,22,0,1
9003,http://web.indosuksesfutures.com/content_file/...,1,0,0,0,0,0,0,80,25,0,1
34704,http://www.nutritionaltree.com/subcat.aspx?cid...,0,0,0,0,0,0,0,69,23,0,1


### Crando el modelo

In [None]:
X = data.drop(['url', 'phishing'], axis=1)

In [None]:
y = data.phishing

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [None]:
clf = RandomForestClassifier(n_jobs=-1, n_estimators=100)

In [None]:
cross_val_score(clf, X, y, cv=10)

array([0.806  , 0.81425, 0.80725, 0.7915 , 0.8035 , 0.81   , 0.80125,
       0.805  , 0.8035 , 0.792  ])

In [None]:
clf.fit(X, y)

RandomForestClassifier(n_jobs=-1)

### Guardando el modelo

In [None]:
import joblib

In [None]:
joblib.dump(clf, 'data/07_phishing_clf.pkl', compress=3)

['data/07_phishing_clf.pkl']

## Parte 2: Probar el modelo con una muestra

Consulte el archivo m07_model_deployment.py

In [None]:
from m07_model_deployment import predict_proba

In [None]:
predict_proba('http://www.vipturismolondres.com/com.br/?atendimento=Cliente&/LgSgkszm64/B8aNzHa8Aj.php')

0.6164826839826839

## Parte 3: API

Flask es condiderado más Pythonic que Djancon porque el código para generar una aplicación web con Flask es a menudo más explícito. Es fácil iniciar con Flask como principiante porque existe una porción de código base para construir sobre esta una app simple y ejecutable.

### Primero, necesitamos instalar algunos paquetes
[Flask REST Plus](https://flask-restplus.readthedocs.io/en/stable/marshalling.html#basic-usage)

```
pip install flask-restplus
```

In [None]:
!pip install flask-restplus

Collecting flask-restplus
  Downloading flask_restplus-0.13.0-py2.py3-none-any.whl (2.5 MB)
Collecting aniso8601>=0.82
  Downloading aniso8601-9.0.1-py2.py3-none-any.whl (52 kB)
Installing collected packages: aniso8601, flask-restplus
Successfully installed aniso8601-9.0.1 flask-restplus-0.13.0


### Carguemos Flask

In [None]:
import werkzeug
from werkzeug.utils import cached_property

import flask

werkzeug.cached_property = cached_property

from flask import Flask
from flask_restplus import Api
from flask_restplus import fields
from flask_restplus import Resource
import joblib
import pandas as pd

### Creamos la API

In [None]:
app = Flask(__name__)

api = Api(
    app, 
    version='1.0', 
    title='Phishing Prediction API',
    description='Phishing Prediction API')

ns = api.namespace('predict', 
     description='Phishing Classifier')
   
parser = api.parser()

parser.add_argument(
    'URL', 
    type=str, 
    required=True, 
    help='URL to be analyzed', 
    location='args')

resource_fields = api.model('Resource', {
    'result': fields.String,
})

### Creamos el modelo y cargamos la función que predice las URL

In [None]:
clf = joblib.load('data/07_phishing_clf.pkl') 

@ns.route('/')
class PhishingApi(Resource):

    @api.doc(parser=parser)
    @api.marshal_with(resource_fields)
    def get(self):
        args = parser.parse_args()
        result = self.predict_proba(args)

        return result, 200

    def predict_proba(self, args):
        url = args['URL']
        
        url_ = pd.DataFrame([url], columns=['url'])
        
        # Create features
        keywords = ['https', 'login', '.php', '.html', '@', 'sign']
        for keyword in keywords:
            url_['keyword_' + keyword] = url_.url.str.contains(keyword).astype(int)
        
        url_['lenght'] = url_.url.str.len() - 2
        domain = url_.url.str.split('/', expand=True).iloc[:, 2]
        url_['lenght_domain'] = domain.str.len()
        url_['isIP'] = (url_.url.str.replace('.', '') * 1).str.isnumeric().astype(int)
        url_['count_com'] = url_.url.str.count('com')

        # Make prediction
        p1 = clf.predict_proba(url_.drop('url', axis=1))[0,1]

        print('url=', url,'| p1=', p1)

        return {
         "result": p1
        }

### Corremos la API

In [None]:
app.run(debug=True, use_reloader=False, host='0.0.0.0', port=5000)

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: on


 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit)
127.0.0.1 - - [02/Nov/2021 10:22:14] "[37mGET /predict/?URL=http://consultoriojuridico.co/pp/www.paypal.com/ HTTP/1.1[0m" 200 -


url= http://consultoriojuridico.co/pp/www.paypal.com/ | p1= 0.3004623823888529


127.0.0.1 - - [02/Nov/2021 10:22:15] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -


### Verifiquemos que funciona

* http://localhost:5000/predict/?URL=http://consultoriojuridico.co/pp/www.paypal.com/
