<a href="https://colab.research.google.com/github/axel-sirota/practical-nlp/blob/main/4-ner/Practical_NLP_11_NER_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!python -m spacy download en_core_web_trf

2022-10-16 17:10:38.769052: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-trf==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.4.0/en_core_web_trf-3.4.0-py3-none-any.whl (460.3 MB)
[K     |████████████████████████████████| 460.3 MB 35 kB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [2]:
!pip install --upgrade spacy-transformers 'spacy==3.4' spacy[transformers]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting spacy==3.4
  Downloading spacy-3.4.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 5.0 MB/s 
Installing collected packages: spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 3.4.1
    Uninstalling spacy-3.4.1:
      Successfully uninstalled spacy-3.4.1
Successfully installed spacy-3.4.0


In [3]:
import numpy as np
import spacy
import sys
import pandas as pd
np.random.seed(42)


In [4]:
%%writefile get_data.sh
if [ ! -f new_york_reduced.csv ]; then
  wget -O new_york_reduced.csv https://www.dropbox.com/s/fcagfdzahya1ttn/new_york_reduced.csv?dl=0
fi


Overwriting get_data.sh


In [5]:
!bash get_data.sh

In [6]:
nlp = spacy.load("en_core_web_trf")
dataset = pd.read_csv('new_york_reduced.csv')[:100][["id", "name", "description", "neighbourhood_cleansed", "property_type"]]
dataset.head()

Unnamed: 0,id,name,description,neighbourhood_cleansed,property_type
0,2595,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Midtown,Entire apartment
1,3831,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Bedford-Stuyvesant,Entire guest suite
2,5121,BlissArtsSpace!,<b>The space</b><br />HELLO EVERYONE AND THANK...,Bedford-Stuyvesant,Private room in apartment
3,5136,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,Sunset Park,Entire apartment
4,5178,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,Midtown,Private room in apartment


In [7]:
first_description = dataset["description"].iloc[0]
first_description



'Beautiful, spacious skylit studio in the heart of Midtown, Manhattan. <br /><br />STUNNING SKYLIT STUDIO / 1 BED + SINGLE / FULL BATH / FULL KITCHEN / FIREPLACE / CENTRALLY LOCATED / WiFi + APPLE TV / SHEETS + TOWELS<br /><br /><b>The space</b><br />- Spacious (500+ft²), immaculate and nicely furnished & designed studio.<br />- Tuck yourself into the ultra comfortable bed under the skylight. Fall in love with a myriad of bright lights in the city night sky. <br />- Single-sized bed/convertible floor mattress with luxury bedding (available upon request).<br />- Gorgeous pyramid skylight with amazing diffused natural light, stunning architectural details, soaring high vaulted ceilings, exposed brick, wood burning fireplace, floor seating area with natural zafu cushions, modern style mixed with eclectic art & antique treasures, large full bath, newly renovated kitchen, air conditioning/heat, high speed WiFi Internet, and Apple TV.<br />- Centrally located in the heart of Midtown Manhatta

In [8]:
doc = nlp(first_description)

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)



1 107 108 CARDINAL
Apple TV.<br 932 944 ORG
Midtown Manhattan 983 1000 LOC


In [9]:
tags = []

def update_caches(document):
    doc = nlp(document)
    inner_tags = []
    for ent in doc.ents:
        inner_tags.append(ent.text)
    tags_to_append = inner_tags if len(inner_tags) else None
    tags.append(tags_to_append)

update_caches = np.vectorize(update_caches)

In [10]:
_ = update_caches(dataset[["description"]].values)

In [11]:
dataset = pd.concat([dataset, pd.Series(tags, name="tags")], axis=1)
dataset["id"] = pd.to_numeric(dataset["id"], downcast='integer')
dataset = dataset[:100]

In [12]:
dataset

Unnamed: 0,id,name,description,neighbourhood_cleansed,property_type,tags
0,2595.0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Midtown,Entire apartment,"[1, Apple TV.<br, Midtown Manhattan]"
1,3831.0,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Bedford-Stuyvesant,Entire guest suite,"[1, Apple TV.<br, Midtown Manhattan]"
2,5121.0,BlissArtsSpace!,<b>The space</b><br />HELLO EVERYONE AND THANK...,Bedford-Stuyvesant,Private room in apartment,"[500, AirBnbs, Airbnb, 7, 5, minutes, Brooklyn]"
3,5136.0,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,Sunset Park,Entire apartment,"[the last year few years, U.K., Germany, Italy..."
4,5178.0,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,Midtown,Private room in apartment,"[2, South Slope,, Brooklyn, few months, 4, 2, ..."
...,...,...,...,...,...,...
95,38663.0,Luxury Brownstone in Boerum Hill,"Beautiful, large home in great hipster neighbo...",Boerum Hill,Entire house,"[30 days, 1, 10th Street, the Christopher Stre..."
96,39282.0,“Work-from-home” from OUR home.,*Monthly Discount Available*<br />Your home-of...,Williamsburg,Private room in apartment,"[Boerum Hill, one, Atlantic/Pacific streets, m..."
97,39572.0,1 br in a 2 br apt (Midtown West),<b>The space</b><br />1 bedroom in a 2 bedroom...,Hell's Kitchen,Private room in apartment,"[Monthly, 3, three, Brooklyn, Holly, Russel, 1..."
98,39586.0,Big room w/private bathroom and great Hudson view,You are renting a big room with private bathro...,Upper West Side,Private room in apartment,"[2, Midtown West, Times Square, Xmas, New Year..."
