In [1]:
import numpy as np
import pandas as pd
ipi install spacy
pip install nltk

# Importing Scraped Datasets

The directory path in the code below is unique to my laptop, so when you download the raw datasets, you should modify to adjust to the directory that the dataset is located in your computer.

In [3]:
roadsonline=pd.read_csv('/Users/./Desktop/AUS_DATASETS/rawdatasets/roadsonline.csv')
projectory=pd.read_csv('/Users/./Desktop/AUS_DATASETS/rawdatasets/projectory.csv')
reneweconomy=pd.read_csv('/Users/./Desktop/AUS_DATASETS/rawdatasets/reneweconomy.csv')
utilitymag=pd.read_csv('/Users/./Desktop/AUS_DATASETS/rawdatasets/utilitymag.csv')

# Cleaning Datasets

In [4]:
# leaving the columns we are interested in
roadsonline=roadsonline[['wrap','date article', 'author', 'category', 'content']]
projectory=projectory[['summary', 'contentsummary', 'details','link',
                       'content', 'projname', 'projstatus', 'projsite',
                       'startdate', 'contactperson', 'contactorg', 'overallcontacts',
                       'projdetailsoverall']]
reneweconomy=reneweconomy[['link', 'summary', 'author', 'date','content']]
utilitymag=utilitymag[['wrap', 'author', 'date', 'category', 'summary', 'content']]

### Roadsonline

In [5]:
# cleaning cetegory column from unnecessary symbols
roadsonline['category']=roadsonline['category'].str.replace('\n',' ')
roadsonline['category']=roadsonline['category'].str.replace('\t',' ')
# cleaning content column from unnecessary symbols
roadsonline['content']=roadsonline['content'].str.replace('\n',' ')
roadsonline['category']=roadsonline['category'].str.replace('\t',' ')
roadsonline['content']=roadsonline['content'].str.replace('\xa0',' ')
# Removing story suggestions, social media links at the end of article
def spliting(text):
    return text.split('Related stories')[0] ## you can use this function to split the article at any stop word
## in this case the stop word was "Related Stories", just duplicate this function but insert the stop word you want
i=0
list_text=[]
while i<4591:
    list_text.append(spliting(str(roadsonline['content'][i])))
    i+=1
roadsonline['content']=list_text
# Removing article suggestions, social media links at the end of article
def spliting_new(text):
    return text.split('Related articles')[0] ## in this case the stop word was "Related articles"
i=0
list_text_new=[]
while i<4591:
    list_text_new.append(spliting_new(str(roadsonline['content'][i])))
    i+=1
roadsonline['content']=list_text_new

### Projectory

In [6]:
# leaving the columns we are interested in
projectory=projectory[['summary','contentsummary','details', 'link','content',
                       'projdetailsoverall','overallcontacts']]
# cleaning content column from unnecessary symbols
projectory['content']=projectory['content'].str.replace('\n',' ')
projectory['content']=projectory['content'].str.replace('\t',' ')
projectory['content']=projectory['content'].str.replace('\xa0',' ')

# cleaning other columns from unnecessary symbols

projectory['summary']=projectory['summary'].str.replace('\n',' ')
projectory['summary']=projectory['summary'].str.replace('\t',' ')
projectory['summary']=projectory['summary'].str.replace('\xa0',' ')

projectory['contentsummary']=projectory['contentsummary'].str.replace('\n',' ')
projectory['contentsummary']=projectory['contentsummary'].str.replace('\t',' ')
projectory['contentsummary']=projectory['contentsummary'].str.replace('\xa0',' ')

projectory['details']=projectory['details'].str.replace('\n',' ')
projectory['details']=projectory['details'].str.replace('\t',' ')
projectory['details']=projectory['details'].str.replace('\xa0',' ')

projectory['link']=projectory['link'].str.replace('\n',' ')
projectory['link']=projectory['link'].str.replace('\t',' ')
projectory['link']=projectory['link'].str.replace('\xa0',' ')

projectory['projdetailsoverall']=projectory['projdetailsoverall'].str.replace('\n',' ')
projectory['projdetailsoverall']=projectory['projdetailsoverall'].str.replace('\t',' ')
projectory['projdetailsoverall']=projectory['projdetailsoverall'].str.replace('\xa0',' ')

projectory['overallcontacts']=projectory['overallcontacts'].str.replace('\n',' ')
projectory['overallcontacts']=projectory['overallcontacts'].str.replace('\t',' ')
projectory['overallcontacts']=projectory['overallcontacts'].str.replace('\xa0',' ')

Reneweconomy and Utilitymag are more general newspaper websites, even though they contain useful information regarding construction projects. You can apply the same steps as above to clean these datasets after preliminary human analysis of how the content and other columsn look like and identifying possible issues, such as unnecessary symbols, image descriptions such as in Reneweconomy. These can be resolved by using the splitting functions and string replace commands shown above.

# Training the First Model (Builtin SPACY labels + our added labels)

We will be first showing results on roadsonlne dataset

## Specification 1

The first model has 2 specifications. In specification 1, we will train the model separately on each label we add. For example, we will add Project label, to help spacy identify the project name from text and we will run it to get results. Then we will separately add Cost label to the initial spacy tool to help it find the cost.

In [7]:
import spacy
nlp=spacy.load("en_core_web_sm") 

# Getting the ner component
ner=nlp.get_pipe('ner')



#### Adding Label "Project"

In [8]:
LABEL = "PROJECT"

TRAIN_DATA=[

    ('The cost of building Melbourne’s new West Gate Tunnel has soared to at least $10 billion as a stoush between Transurban, two builders and the Victorian government over who should stump up billions of dollars to finish the troubled project heats up.', {"entities": [(37, 53, "PROJECT")]}),
    ('Major construction on the four-kilometre Warringah Freeway upgrades will begin in late 2021, while some work will begin as early as March.',{"entities": [(41, 67, "PROJECT")]}),
    ('The new Western Sydney Airport rail link to St Marys station has been given planning approval by the NSW government.',{"entities": [(8, 40, "PROJECT")]}),
    ('The Palaszczuk Government has scrapped plans for a $250 million underground busway at Roma St as part of its Cross River Rail project.',{"entities": [(109, 125, "PROJECT")]}),
    ('The $630 million Albion Park Rail Bypass has reached a new milestone, with the northbound lanes between Yallah and Oak Flats opened to traffic from Saturday.',{"entities": [(17, 40, "PROJECT")]}),
    ('Feedback submissions from Queenslanders are encouraged on the $1 billion Winchester South coal mine located 30 km southeast of Moranbah in the Bowen Basin.',{"entities": [(73, 99, "PROJECT")]}),
    ('The Australian Rail Track Corporation (ARTC) has awarded John Holland the design and construction contract for the Botany Rail Duplication Project.',{"entities": [(115, 146, "PROJECT")]}),
    ('Tenders ACT has issued a Notice of Upcoming Tender for the design and construction of the John Gorton Drive and Molonglo River Bridge Crossing project.',{"entities": [(90, 150, "PROJECT")]}),
    ('The executive director of the North-South Corridor project, Susana Fueyo, will brief local suppliers and subcontractors on the $9.9bn next stage of development.',{"entities": [(30, 58, "PROJECT")]}),
    ('Energy Developments Limited has secured the contract to build and operate the Jabiru hybrid power station.',{"entities": [(78, 105, "PROJECT")]}),
    ('Coleman Rail in partnership with KBR has been selected to deliver Stage Two of the Shepparton Line Upgrade.',{"entities": [(83, 106, "PROJECT")]}),
    ('According to Rail Projects Victoria, the Shepparton Line Upgrade is set to create 600 jobs over the three stages.',{"entities": [(41, 64, "PROJECT")]}),
    ('ACCIONA has secured approval from the Queensland Government for its Material Change of Use for the Aldoga Solar Farm development 20 km north west Gladstone, Queensland.',{"entities": [(99, 116, "PROJECT")]}),
    ('The proponents Bowen Pipeline Company(BPC)are pushing for co-ordinated project status and aiming for a mid-2023 start to construction for the Bowen Pipeline Project.',{"entities": [(142, 164, "PROJECT")]}),
    ('Publicly-owned generator Stanwell Corporation is partnering with the Japanese Iwatani Corporation to develop the three-gigawatt Aldoga renewable hydrogen facility (ARHP) in Aldoga, west of Gladstone.',{"entities": [(128, 162, "PROJECT")]}),
    ('Rio Tinto has awarded NRW a contract for the delivery of a 34MW Solar PV System at the Gudai Darri mine.',{"entities": [(87, 103, "PROJECT")]}),
    ('The Pacific Industrial Company recently secured the steel fabrication package for the Public Transport Authority (PTA) of Western Australia’s METRONET Airport Central Station Project at Perth Airport.',{"entities": [(142, 182, "PROJECT")]}),
    ('The Forrestfield-Airport Link is scheduled to open in the second half of 2021, and provide a 20 minute direct link between the eastern foothills, central business district, wider public transport network and airport.',{"entities": [(4, 29, "PROJECT")]}),
    ('Proponent SE Waroona Development Pty Ltd - part of Victorian-based company South Energy - is behind the proposed Waroona Solar Farm 11km south-west of Waroona and 30 km south of Pinjarra in Western Australia and spanning 300 hectares.',{"entities": [(113, 131, "PROJECT")]}),
    ('Transport for NSW (TfNSW) called for a team of design and construction Non-Owner Participants to form a Project Alliance to procure and deliver the design and construction of the Heathcote Bridge Widening project.',{"entities": [(179, 204, "PROJECT")]}),
    ('Magnetite Mines has released a pre-feasibility study (PFS) for its flagship Razorback project in South Australia.',{"entities": [(76, 93, "PROJECT")]}),
    ('The Federal Government is investing over $10 billion in a safer, smoother and more reliable Bruce Highway.',{"entities": [(92, 105, "PROJECT")]}),
    ('Weather permitting, the Genex Kidston Connection Project is due to be completed in early 2024.',{"entities": [(24, 56, "PROJECT")]}),
    ('Major construction is scheduled to commence soon on the $176.2 million Bendigo and Echuca Line Upgrade in Victoria following the appointment of John Holland to deliver the track upgrades.',{"entities": [(71, 102, "PROJECT")]}),
    ('The new Goornong Station is projected to be completed by the end of 2021',{"entities": [(8, 24, "PROJECT")]}),
    ('while the new Huntly Station to be completed in mid-2022.',{"entities": [(14, 28, "PROJECT")]}),
    ('The Raywood Station on the Swan Hill Line is due to be completed at the end of 2022. ',{"entities": [(4, 19, "PROJECT")]}),
    ('Western Australia’s $230 million Swan River Crossings project in Fremantle is a step closer to construction following the unveiling of the new alignment that was the result of an extensive community consultation.',{"entities": [(33, 61, "PROJECT")]}),
    ('The Australian Rail Track Corporation (ARTC) has marked a new milestone on the $400 million Port Botany Rail Line Duplication and Cabramatta Loop Project with the awarding of two major contracts.',{"entities": [(92, 153, "PROJECT")]}),
    ('Another METRONET project is moving forward as construction commences on the new Lakelands Station, which is located between the Warnbro and Mandurah stations in Western Australia.',{"entities": [(80, 97, "PROJECT")]}),
    ('The approximately $3.5 billion Iron Bridge Magnetite Project is located 145 kilometres south of Port Hedland in the Pilbara region.',{"entities": [(31, 60, "PROJECT")]}),
    ('Weather permitting, works on the car park upgrade is due to be completed by late 2022, with the entire Ferny Grove project completed by late 2023.',{"entities": [(103, 114, "PROJECT")]}),
    ('Feedback from the community is going to be considered as part of the detailed design for the $327.5 million Stage 1 upgrade on the Mooloolah River Interchange.',{"entities": [(131, 158, "PROJECT")]}),
    ('The State Government is investing $84 million to deliver upgrades to the East and West Tamar Highways',{"entities": [(73, 101, "PROJECT")]})
]

In [9]:
# Add the new label to ner
ner.add_label(LABEL)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [10]:
from spacy.training.example import Example
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

    sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
    for itn in range(30):
    # shuffle examples before training
        random.shuffle(TRAIN_DATA) ## make sure the TRAIN_DATA name is the same
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=sizes)
        # ictionary to store losses
        losses = {}
        for batch in spacy.util.minibatch(TRAIN_DATA, size=2):
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp.update([example], losses=losses, drop=0.3)
                print("Losses", losses)
              # Calling update() over the iteration
            


Losses {'ner': 7.806313018005381}
Losses {'ner': 11.702948203214476}
Losses {'ner': 17.14323264678997}
Losses {'ner': 21.42310412956909}
Losses {'ner': 24.671940056173543}
Losses {'ner': 28.60090874891962}
Losses {'ner': 31.34576019979143}
Losses {'ner': 35.12655208922014}
Losses {'ner': 37.4028881840643}
Losses {'ner': 39.62845951735621}
Losses {'ner': 44.755189677696805}
Losses {'ner': 47.7885478493951}
Losses {'ner': 49.71108761027069}
Losses {'ner': 51.558958577686305}
Losses {'ner': 54.1458883519776}
Losses {'ner': 56.8340124267278}
Losses {'ner': 58.23863942753152}
Losses {'ner': 60.569839692563995}
Losses {'ner': 62.57246781566519}
Losses {'ner': 67.61650750134349}
Losses {'ner': 68.92172324547974}
Losses {'ner': 71.94119818367187}
Losses {'ner': 73.93921806228168}
Losses {'ner': 77.48847308340582}
Losses {'ner': 79.33861957762765}
Losses {'ner': 80.64276938407065}
Losses {'ner': 82.67971284618191}
Losses {'ner': 84.25551388237908}
Losses {'ner': 86.20502936319146}
Losses {'ner'

Important Note: If after running the cell above you get a red line in any part of the output, this doesn't mean that the code is wrong or the Named Entity Recognition (NER) will fail. It simply means that in the train data that we gave above, there is a mistake in the number of position of the project name in the sentence. This can be resolved by reading the red line which will tell us which sentence is labeled wrong in the train data and correcting it.

Now the NER is trained with our Project label (in addition to its builtin train data). We will define function which will give us the prediction.

In [11]:
def project(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

In [12]:
# Example
project(str(roadsonline['content'][9]))

[('Gippsland Line Upgrade', 'PROJECT'),
 ('Shepparton Line Upgrade', 'PROJECT'),
 ('Bendigo Line Upgrade', 'PROJECT'),
 ('Gippsland-Shepparton-Bendigo', 'PROJECT'),
 ('Gippsland Line Upgrade  Building', 'PROJECT'),
 ('Avon River', 'PROJECT'),
 ('Morwell crossing loop', 'PROJECT'),
 ('Shepparton Line Upgrade  Platform', 'PROJECT'),
 ('Bendigo and Echuca Line Upgrade  Track upgrades', 'PROJECT')]

In [13]:
# Applying this function to the whole dataset
i=0
list_proj=[]
while i<len(roadsonline):
    list_proj.append(project(str(roadsonline['content'][i])))
    i+=1

#### Adding More Labels

In [14]:
LABEL1 = "COST"

TRAIN_DATA1=[

    ('The cost of building Melbourne’s new West Gate Tunnel has soared to at least $10 billion as a stoush between Transurban, two builders and the Victorian government over who should stump up billions of dollars to finish the troubled project heats up.', {"entities": [(77, 88, "COST")]}),
    ('The Palaszczuk Government has scrapped plans for a $250 million underground busway at Roma St as part of its Cross River Rail project.',{"entities": [(51, 63, "COST")]}),
    ('The $630 million Albion Park Rail Bypass has reached a new milestone, with the northbound lanes between Yallah and Oak Flats opened to traffic from Saturday.',{"entities": [(4, 14, "COST")]}),
    ('Feedback submissions from Queenslanders are encouraged on the $1 billion Winchester South coal mine located 30 km southeast of Moranbah in the Bowen Basin.',{"entities": [(62, 72, "COST")]}),
    ('Metso Outotec stated orders like these are usually worth around €15–20 million ($23–31 million), but a specific figure for this contract wasn’t disclosed.',{"entities": [(80, 94, "COST")]}),
    ('The $15 million expansion project will increase the facility’s capacity by about 50 per cent, with Jemena set to transport gas from the plant on behalf of Senex Energy to its customer GLNG.',{"entities": [(4, 15, "COST")]}),
    ('Interested parties have been given notice that tendering will soon commence on a $175 million road upgrade project in the ACT.',{"entities": [(81, 93, "COST")]}),
    ('The project is part of the $4 billion Regional Rail Revival program, which is upgrading every regional passenger rail line in Victoria and creating 3000 jobs and supplier opportunities.',{"entities": [(27, 37, "COST")]}),
    ('The $550 million project is anticipated to support up to 350 construction jobs and about 10 ongoing operational jobs.',{"entities": [(4, 16, "COST")]}),
    ('The Mid-West/ Wheatbelt Joint Development Assessment Panel approved the $250 million Waroona project on November 15 under conditions',{"entities": [(72, 84, "COST")]}),
    ('“The total project cost is $73 million,” Mr Jordan said.',{"entities": [(27, 38, "COST")]}),
    ('A key study has put a pricetag of up to $675 million on a south Australian iron ore project.',{"entities": [(40, 52, "COST")]}),
    ('The Federal Government is investing over $10 billion in a safer, smoother and more reliable Bruce Highway.',{"entities": [(42, 53, "COST")]}),
    ('Major construction is scheduled to commence soon on the $176.2 million Bendigo and Echuca Line Upgrade in Victoria following the appointment of John Holland to deliver the track upgrades.',{"entities": [(56, 70, "COST")]}),
    ('Western Australia’s $230 million Swan River Crossings project in Fremantle is a step closer to construction following the unveiling of the new alignment that was the result of an extensive community consultation.',{"entities": [(20, 32, "COST")]}),
    ('The Australian Rail Track Corporation (ARTC) has marked a new milestone on the $400 million Port Botany Rail Line Duplication and Cabramatta Loop Project with the awarding of two major contracts.',{"entities": [(79, 91, "COST")]}),
    ('The $82 million metallurgical coal mine extension project is expected to support approximately 550 jobs in the central Queensland coalfields.',{"entities": [(4, 15, "COST")]}),
    ('The approximately $3.5 billion Iron Bridge Magnetite Project is located 145 kilometres south of Port Hedland in the Pilbara region.',{"entities": [(18, 30, "COST")]}),
    ('The $14 million project is jointly funded by the Australian and Western Australian Governments along with third-party stakeholders',{"entities": [(4, 15, "COST")]}),
    ('Major construction has commenced on the $140 million Transit Oriented Development (TOD) at the Ferny Grove Train Station in Queensland, with works to be delivered by Queensland-based developer Honeycombes Property Group.',{"entities": [(40, 52, "COST")]}),
    ('Construction on an $11 million project that will replace the existing Rocky Creek Bridge that was built in 1928 is set to commence in August of this year.',{"entities": [(19, 30, "COST")]}),
    ]

The same steps should be repeated as we did for the first label. Note that when we add this new label and run the code below, the previous model is automatically deleted and a new model is created and trained with only the train data in the chunk above. Hence, we won't be able to get the project label anymore, but we will be only able to get the cost label now.

In [15]:
ner.add_label(LABEL1)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [16]:
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

    sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
    for itn in range(30):
    # shuffle examples before training
        random.shuffle(TRAIN_DATA1)
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA1, size=sizes)
        # ictionary to store losses
        losses = {}
        for batch in spacy.util.minibatch(TRAIN_DATA1, size=2):
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp.update([example], losses=losses, drop=0.3)
                print("Losses", losses)
              # Calling update() over the iteration
            

Losses {'ner': 2.000022614324966}
Losses {'ner': 6.000017730292843}
Losses {'ner': 10.000009743273898}
Losses {'ner': 12.000009743275191}
Losses {'ner': 17.71809114246349}
Losses {'ner': 19.718088559898924}
Losses {'ner': 23.717882911023754}
Losses {'ner': 27.696806000012256}
Losses {'ner': 29.696108428557046}
Losses {'ner': 33.645952964944655}
Losses {'ner': 35.814086619684694}
Losses {'ner': 37.78838549857259}
Losses {'ner': 39.701407779845134}
Losses {'ner': 41.13664776726528}
Losses {'ner': 42.16541261626015}
Losses {'ner': 44.16823525674955}
Losses {'ner': 46.062502864417915}
Losses {'ner': 47.38313088440445}




Losses {'ner': 48.43600736361879}
Losses {'ner': 48.78832823085344}
Losses {'ner': 50.37332263725196}
Losses {'ner': 1.0497186116163597}
Losses {'ner': 2.0065499116173173}
Losses {'ner': 6.93397417621503}
Losses {'ner': 7.801058049669404}
Losses {'ner': 8.681155559050781}
Losses {'ner': 9.897066843262264}
Losses {'ner': 33.97370998797412}
Losses {'ner': 77.72507781673667}
Losses {'ner': 86.71805668093239}
Losses {'ner': 92.00319964739882}
Losses {'ner': 97.76833220710655}
Losses {'ner': 103.25931124940935}
Losses {'ner': 104.1025709685952}
Losses {'ner': 108.1488093457728}
Losses {'ner': 111.3938239542259}
Losses {'ner': 114.27636825633506}
Losses {'ner': 116.98176315500855}
Losses {'ner': 122.61096954912462}
Losses {'ner': 129.34075409751057}
Losses {'ner': 137.48004734336865}
Losses {'ner': 138.58811381391266}
Losses {'ner': 3.8694474694269934}
Losses {'ner': 5.797504601072016}
Losses {'ner': 10.845260066186917}
Losses {'ner': 11.927856197792153}
Losses {'ner': 12.789313375125372}
Lo

In [17]:
def cost(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

In [18]:
# If you don't like the format of how the above function outputs, you can try the version below by uncommenting it


# def cost(text):
#     doc = nlp(text)
#     return str([ent.text for ent in doc.ents])

In [19]:
i=0
list_cost=[]
while i<4591:
    list_cost.append(cost(str(roadsonline['content'][i])))
    i+=1
    
# 4591 is the length of dataset, so when applying to other datasets make sure to change it

#### Adding Start Label

In [20]:
LABEL2 = "START"

TRAIN_DATA2=[

    ('WA Premier, Mark McGowan, said that early works on the project are expected to commence later in 2021, with major construction kicking off in early 2022. Mr McGowan added that the project will support more than 1,400 local jobs.', {"entities": [(142, 152, "START")]}),
    ('The NSW Government will start tendering for contracts and hopes to begin construction in the first quarter of 2021.',{"entities": [(110, 114, "START")]}),
    ('Project updates for the so-called Western Harbour Tunnel between Rozelle and North Sydney, and Beaches Link motorway between Cammeray and Seaforth and Balgowlah, say construction on both roads will start in 2020-21, with the roads opening in 2025-26.',{"entities": [(207, 214, "START")]}),
    ('The first stage of construction on the new airport began on 24 September 2018, and the first stage is expected to be complete and open by December 2026.',{"entities": [(60, 77, "START")]}),
    ('Major construction on the four-kilometre Warringah Freeway upgrades will begin in late 2021, while some work will begin as early as March.',{"entities": [(82, 91, "START")]}),
    ('Major construction is expected to begin in October',{"entities": [(43, 50, "START")]}),
    ('The contract will start and end in 2022, with plant production expected late in the final quarter of the year.',{"entities": [(35, 39, "START")]}),
    ('Construction for the transmission line is expected to commence in April, 2022, for completion by early 2024.',{"entities": [(66, 77, "START")]}),
    ('Early works have begun on an $80 million rail infrastructure project south of Perth for which the head contractor is locked in and main works will begin in October.',{"entities": [(156, 163, "START")]}),
    ('Following site mobilisation in the second half of 2021, major construction works on both the Botany Rail Duplication and Cabramatta Loop are scheduled to commence in the first quarter of 2022.',{"entities": [(170, 191, "START")]}),
    ('The Federal Department of Infrastructure, Transport, Regional Development and Communications said construction was expected to commence in late 2022 and be completed by late 2025.',{"entities": [(139, 148, "START")]}),
    ('With the combined funds received from the placement and SPP, the company now has the financial capacity to meet its share of final feasibility study costs and commitment to early lead time items under the joint venture with Minotaur and allow construction and mining to commence at the project in 2022 once a final decision to proceed is made by the joint venture partners.',{"entities": [(297, 301, "START")]}),
    ('Early works on the project is anticipated to commence in 2022 with construction expected to take 18 months to complete. ',{"entities": [(57, 61, "START")]}),
    ('The proponents Bowen Pipeline Company(BPC)are pushing for co-ordinated project status and aiming for a mid-2023 start to construction for the Bowen Pipeline Project.',{"entities": [(103, 111, "START")]}),
    ('Expected commencement of construction is December, 2021.',{"entities": [(41, 55, "START")]}),
    ('NRW said design and procurement would commence immediately and be followed by commencement of construction in August 2021.',{"entities": [(110, 121, "START")]}),
    ('Subject to approvals construction is expected to start in 2021 with full operation in 2022.',{"entities": [(58, 62, "START")]}),
    ('Construction for early works is scheduled to commence next year, subject to all relevant approvals.',{"entities": [(54, 63, "START")]}),
    ('“It is anticipated that construction of stage 1 would begin in 2024 with the first operations to commence in 2025,” Spark Infrastructure said.',{"entities": [(63, 67, "START")]}),
    ('Early construction activities are set to commence in late 2021 and major construction in 2022. ',{"entities": [(53, 62, "START")]}),
    ('“Core is excited to continue accomplishing crucial milestones for the project as we prepare to commence construction before the end of 2021, and deliver first production in 2022,” added Biggins. ',{"entities": [(128, 139, "START")]}),
    ('Weather and construction conditions permitting, early works will begin in early 2022 and completed by 2023 while the entire Stage 1 of the B2N project is expected to be completed by 2025. ',{"entities": [(74, 84, "START")]}),    
    ('“With construction expected to begin in the first half of 2022, we are keen to start the conversation with local businesses on ways they could support the project. ',{"entities": [(44, 62, "START")]}),    
    ('Subject to relevant State and Federal Government approvals, initial works on this project is anticipated to commence in 2023. ',{"entities": [(120, 124, "START")]}),
    ('“We anticipate early works to commence later this year, with major construction kicking off in early 2022 ',{"entities": [(95, 105, "START")]}),
    ('Both companies are expected to mobilise on site in the second half of 2021 and major construction on the Port Botany Rail Line Duplication and Cabramatta Loop projects are anticipated to commence in the first quarter of 2022.  ',{"entities": [(203, 224, "START")]}),
    ('Early works on this development commenced in December 2020. Major construction is due to begin in the coming months and the tunnelling contract to be awarded by the end of this year, with the massive tunnel boring machines to begin digging by the end of 2023.',{"entities": [(95, 115, "START")]}),
    ('Construction on an $11 million project that will replace the existing Rocky Creek Bridge that was built in 1928 is set to commence in August of this year. ',{"entities": [(134, 140, "START")]}),

]

In [21]:
ner.add_label(LABEL2)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [22]:
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

    sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
    for itn in range(30):
    # shuffle examples before training
        random.shuffle(TRAIN_DATA2)
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA2, size=sizes)
        # ictionary to store losses
        losses = {}
        for batch in spacy.util.minibatch(TRAIN_DATA2, size=2):
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp.update([example], losses=losses, drop=0.3)
                print("Losses", losses)
              # Calling update() over the iteration
            

Losses {'ner': 1.999997735023501}
Losses {'ner': 3.9999752044684156}
Losses {'ner': 5.793811082840781}
Losses {'ner': 9.793782353532238}
Losses {'ner': 11.746921782807473}
Losses {'ner': 13.947468297464528}
Losses {'ner': 15.94737437231785}
Losses {'ner': 19.93263166597194}
Losses {'ner': 21.912057067659614}
Losses {'ner': 23.895345184693184}
Losses {'ner': 25.89594271200422}
Losses {'ner': 27.904832615170516}
Losses {'ner': 29.33392958086583}
Losses {'ner': 31.219858669529707}
Losses {'ner': 33.660584957034146}
Losses {'ner': 35.235980388448404}
Losses {'ner': 36.865662622255435}
Losses {'ner': 38.336000509062195}
Losses {'ner': 40.02514909208991}
Losses {'ner': 41.73327761220867}
Losses {'ner': 44.364393209672045}
Losses {'ner': 45.49727224599515}
Losses {'ner': 48.985173839878925}
Losses {'ner': 50.995752392381284}
Losses {'ner': 52.668280777786194}
Losses {'ner': 55.66620946500407}
Losses {'ner': 57.656604785173315}
Losses {'ner': 59.632906634542024}
Losses {'ner': 1.68064811119127

In [23]:
def start(text):
    doc = nlp(text)
    return str([ent.text for ent in doc.ents])

In [24]:
i=0
list_start=[]
while i<4591:
    list_start.append(cost(str(roadsonline['content'][i])))
    i+=1
    
# 4591 is the length of dataset, so when applying to other datasets make sure to change it

#### Adding End Label

In [25]:
LABEL3 = "END"

TRAIN_DATA3=[

    ('Transurban has not given a new completion date for the West Gate Tunnel, which was originally scheduled to be open by the end of 2022.',{"entities": [(129, 133, "END")]}),
    ('The state government predicts 36,000 people will transfer between trains and buses at Roma Street station daily by the time the project is completed in 2024.',{"entities": [(152, 156, "END")]}),
    ('Construction on the twin tunnels will begin early next year, and be completed by 2025, a year later than the government first intended.',{"entities": [(81, 85, "END")]}),
    ('Project updates for the so-called Western Harbour Tunnel between Rozelle and North Sydney, and Beaches Link motorway between Cammeray and Seaforth and Balgowlah, say construction on both roads will start in 2020-21, with the roads opening in 2025-26.',{"entities": [(242, 249, "END")]}),
    ('The first stage of construction on the new airport began on 24 September 2018, and the first stage is expected to be complete and open by December 2026.',{"entities": [(138, 151, "END")]}),
    ('The contract will start and end in 2022, with plant production expected late in the final quarter of the year.',{"entities": [(35, 39, "END")]}),
    ('Premier Mark McGowan says the battery would likely be the second biggest in Australia once complete in 2022z, although it would largely depend on the status of other big battery projects elsewhere, including the mooted replacement for the Liddell coal generator in NSW.',{"entities": [(103, 107, "END")]}),
    ('The project completion is expected in early 2022',{"entities": [(38, 48, "END")]}),
    ('Construction for the transmission line is expected to commence in April, 2022, for completion by early 2024.',{"entities": [(97, 107, "END")]}),
    ('The Federal Department of Infrastructure, Transport, Regional Development and Communications said construction was expected to commence in late 2022 and be completed by late 2025.',{"entities": [(169, 178, "END")]}),
    ('EDL said the microgrid’s diesel generators were scheduled to come online by the end of 2021 and the solar farm in early 2022, with the entire hybrid power station expected to be fully commissioned by February 2022.',{"entities": [(200, 213, "END")]}),
    ('Stage Two of the Shepparton Line Upgrade is targeted for completion in late 2022.',{"entities": [(71, 80, "END")]}),
    ('Construction and commissioning is scheduled for completion in early 2022.',{"entities": [(62, 72, "END")]}),
    ('The Forrestfield-Airport Link is scheduled to open in the second half of 2021, and provide a 20 minute direct link between the eastern foothills, central business district, wider public transport network and airport.',{"entities": [(58, 77, "END")]}),
    ('Subject to approvals construction is expected to start in 2021 with full operation in 2022.',{"entities": [(86, 90, "END")]}),
    ('Magnetite CEO Peter Schubert said with a small-scale start-up, the company is planning to ship its first iron ore product from the Razorback project at the end of 2024. ',{"entities": [(156, 167, "END")]}),
    ('Project director Frank Fisseler was unavailable for comment, but Projectory understands that the WEP Project team are planning for the majority of the work to be completed by 2023.',{"entities": [(175, 179, "END")]}),
    ('“It is anticipated that construction of stage 1 would begin in 2024 with the first operations to commence in 2025,” Spark Infrastructure said.',{"entities": [(109, 113, "END")]}),
    ('The new Pimpama station is due to be completed in time for the opening of CRR in 2025. ',{"entities": [(81, 85, "END")]}),
    ('The project is expected to create 100 jobs during a projected two-year construction phase and commissioning period, with completion expected in September, 2022.',{"entities": [(144, 159, "END")]}),
    ('Weather and construction conditions permitting, early works will begin in early 2022 and completed by 2023 while the entire Stage 1 of the B2N project is expected to be completed by 2025. ',{"entities": [(182, 186, "END")]}),
    ('Weather permitting, the Genex Kidston Connection Project is due to be completed in early 2024. ',{"entities": [(83, 93, "END")]}),
    ('The new Goornong Station is projected to be completed by the end of 2021',{"entities": [(61, 72, "END")]}),
    ('while the new Huntly Station to be completed in mid-2022.',{"entities": [(48, 56, "END")]}),
    ('The Raywood Station on the Swan Hill Line is due to be completed at the end of 2022. ',{"entities": [(72, 83, "END")]}),
    ('The new ETO will allow for additional weekday return services to both Epsom and Eaglehawk as well as triple the number of weekday train services to and from Echuca once the project is completed in late 2022. ',{"entities": [(197, 206, "END")]}),
    ('Weather and construction conditions permitting, this Bruce Highway upgrade is expected to be completed by late 2023. ',{"entities": [(106, 115, "END")]}),
    ('SIMPEC will commence works immediately, with project completion expected by mid-2022. ',{"entities": [(76, 84, "END")]}),
    ('Works are undertaken by WBHO Infrastructure Pty Ltd and construction is due to be completed in the first half of 2022. ',{"entities": [(99, 117, "END")]}),
    ('Weather permitting, works on the car park upgrade is due to be completed by late 2022, with the entire Ferny Grove project completed by late 2023. ',{"entities": [(136, 145, "END")]}),
]

In [26]:
ner.add_label(LABEL3)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [27]:
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

    sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
    for itn in range(30):
    # shuffle examples before training
        random.shuffle(TRAIN_DATA3)
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA3, size=sizes)
        # ictionary to store losses
        losses = {}
        for batch in spacy.util.minibatch(TRAIN_DATA3, size=2):
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp.update([example], losses=losses, drop=0.3)
                print("Losses", losses)
              # Calling update() over the iteration
            

Losses {'ner': 3.999987363815308}
Losses {'ner': 5.985948732316831}
Losses {'ner': 7.985948733389672}
Losses {'ner': 9.558221034109888}
Losses {'ner': 11.558214523162645}
Losses {'ner': 13.558134526742151}
Losses {'ner': 17.05275640637441}
Losses {'ner': 20.860085358190496}
Losses {'ner': 22.85856425297134}
Losses {'ner': 26.37959479823447}
Losses {'ner': 27.515274645861567}
Losses {'ner': 28.868214967995126}
Losses {'ner': 30.85920746226021}
Losses {'ner': 34.01471767870734}
Losses {'ner': 35.24225800981273}
Losses {'ner': 39.220968606935294}
Losses {'ner': 41.05054116645914}
Losses {'ner': 44.3907630831712}
Losses {'ner': 45.44366453954049}
Losses {'ner': 49.44319431912777}
Losses {'ner': 51.44156594462743}
Losses {'ner': 51.44664221616336}




Losses {'ner': 54.97912967488346}
Losses {'ner': 56.58450986722599}
Losses {'ner': 58.15800255863724}
Losses {'ner': 60.22594145913053}
Losses {'ner': 62.202033967637846}
Losses {'ner': 63.92342284922898}
Losses {'ner': 65.74372746433578}
Losses {'ner': 69.22142872397853}
Losses {'ner': 1.5961569302809293}
Losses {'ner': 3.30108587735717}
Losses {'ner': 4.57532249691125}
Losses {'ner': 5.833076863908518}
Losses {'ner': 6.960600793360738}
Losses {'ner': 39.97668381342109}
Losses {'ner': 39.97990672289053}
Losses {'ner': 41.92547829886486}
Losses {'ner': 50.72802330137857}
Losses {'ner': 54.62697388443704}
Losses {'ner': 55.942574407575556}
Losses {'ner': 57.59352057780477}
Losses {'ner': 60.81790984478481}
Losses {'ner': 62.59831456365964}
Losses {'ner': 64.01762424407082}
Losses {'ner': 66.55852036405678}
Losses {'ner': 69.65731593765574}
Losses {'ner': 71.36210056383935}
Losses {'ner': 74.82658327862337}
Losses {'ner': 75.79270833227564}
Losses {'ner': 77.94379598520717}
Losses {'ner'

In [28]:
def end(text):
    doc = nlp(text)
    return str([ent.text for ent in doc.ents])

In [29]:
i=0
list_end=[]
while i<len(roadsonline):
    list_end.append(cost(str(roadsonline['content'][i])))
    i+=1

#### Adding Contractor Label

In [30]:
LABEL4 = "CONTRACTOR"

TRAIN_DATA4=[

    ('Contractor Fulton Hogan was chosen to build the multi-million dollar bypass after a competitive tender process that shortlisted three companies, which includes CPB Contractors and a joint venture between BMD Constructions and John Holland.',{"entities": [(11, 23, "CONTRACTOR")]}),
    ('Northern Star Resources has appointed GR Engineering Services as its engineering, procurement and construction contractor at the Thunderbox gold operations in Western Australia.',{"entities": [(38, 61, "CONTRACTOR")]}),
    ('The Australian Rail Track Corporation (ARTC) has awarded John Holland the design and construction contract for the Botany Rail Duplication Project.',{"entities": [(57, 69, "CONTRACTOR")]}),
    ('Energy Developments Limited has secured the contract to build and operate the Jabiru hybrid power station.',{"entities": [(0, 27, "CONTRACTOR")]}),
    ('Juwi Renewable Energy in turn has been appointed to construct the 3.9 MW solar farm. ',{"entities": [(0, 21, "CONTRACTOR")]}),
    ('Coleman Rail in partnership with KBR has been selected to deliver Stage Two of the Shepparton Line Upgrade.',{"entities": [(0, 12, "CONTRACTOR")]}),
    ('Rio Tinto has awarded NRW a contract for the delivery of a 34MW Solar PV System at the Gudai Darri mine.',{"entities": [(22, 25, "CONTRACTOR")]}),
    ('The Pacific Industrial Company recently secured the steel fabrication package for the Public Transport Authority (PTA) of Western Australia’s METRONET Airport Central Station Project at Perth Airport.',{"entities": [(4, 30, "CONTRACTOR")]}),
    ('Barminco signed a Letter of Intent (LOI) with Panoramic Resources Limited for a $280 million four-year contract, now finalised, to begin works at the Savannah Nickel Project in WA.',{"entities": [(0, 8, "CONTRACTOR")]}),
    ('Core Lithium has signed a two-year deal with the NT Power and Water Corporation (PWC) to connect its Finniss project to the power grid.',{"entities": [(0, 12, "CONTRACTOR")]}),
    ('Works are undertaken by WBHO Infrastructure Pty Ltd and construction is due to be completed in the first half of 2022. ',{"entities": [(24, 51, "CONTRACTOR")]}),
    ('Major construction has commenced on the $140 million Transit Oriented Development (TOD) at the Ferny Grove Train Station in Queensland, with works to be delivered by Queensland-based developer Honeycombes Property Group. ',{"entities": [(193, 219, "CONTRACTOR")]}),
    ('Construction of Ferny Grove Central is being undertaken by Broad Constructions – a wholly owned subsidiary of CIMIC Group company CPB Contractors',{"entities": [(59, 78, "CONTRACTOR")]}),
    
]

In [31]:
ner.add_label(LABEL4)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [32]:
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

    sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
    for itn in range(30):
    # shuffle examples before training
        random.shuffle(TRAIN_DATA4)
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA4, size=sizes)
        # ictionary to store losses
        losses = {}
        for batch in spacy.util.minibatch(TRAIN_DATA4, size=2):
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp.update([example], losses=losses, drop=0.3)
                print("Losses", losses)
              # Calling update() over the iteration
       

Losses {'ner': 2.0000914828417105}
Losses {'ner': 4.000018407547328}
Losses {'ner': 6.000018408999115}
Losses {'ner': 8.000018549031601}
Losses {'ner': 10.000018549167955}
Losses {'ner': 12.000018075921735}
Losses {'ner': 16.000016526378868}
Losses {'ner': 18.00001645837731}
Losses {'ner': 21.822148669010623}
Losses {'ner': 25.66798698316658}
Losses {'ner': 27.6536527727026}
Losses {'ner': 29.638329547862845}
Losses {'ner': 31.35427895301364}
Losses {'ner': 1.7987351752527934}
Losses {'ner': 3.4272930671811275}
Losses {'ner': 5.4998123010124775}
Losses {'ner': 7.739968634037936}
Losses {'ner': 9.211956728695512}
Losses {'ner': 15.343439174279492}
Losses {'ner': 19.07216234769976}
Losses {'ner': 20.977775343588707}
Losses {'ner': 25.329544511299574}
Losses {'ner': 28.179660623113314}
Losses {'ner': 29.98705254711279}
Losses {'ner': 33.18920696539403}
Losses {'ner': 35.28176251843164}
Losses {'ner': 1.5780875643836225}
Losses {'ner': 2.915824839665996}
Losses {'ner': 3.6764086772155498}


In [33]:
def contractor(text):
    doc = nlp(text)
    return str([ent.text for ent in doc.ents])

In [34]:
i=0
list_contractor=[]
while i<4591:
    list_contractor.append(cost(str(roadsonline['content'][i])))
    i+=1

We have created 5 different lists of tags for each entry in our dataset. Now we just need to make them into columns, which is done below.

In [35]:
roadsonline['project']=list_proj
roadsonline['cost']=list_cost
roadsonline['start']=list_start
roadsonline['end']=list_end
roadsonline['contractor']=list_contractor

In [36]:
# choosing the columns we need
roadsonline[['wrap','date article', 'author', 'category', 'content', 'project',
       'cost', 'start', 'end', 'contractor']]

Unnamed: 0,wrap,date article,author,category,content,project,cost,start,end,contractor
0,Portable seal age assessment device set to spe...,"August 20, 2020",Lauren Jones,"Posted in Industry News, Latest News ...",One of ARRB’s FTIR devices. Image courtesy of ...,"[(Fourier Transform Infrared Spectroscopy, PRO...",[],[],[],"[(Fourier Transform Infrared, CONTRACTOR)]"
1,Route for proposed Mackay Ring Road finalised,"August 18, 2016",pcm_admin,"Posted in Latest News, Project Report",Initial works on the proposed Mackay Ring Road...,"[(Mackay Ring Road between Stockroute Road, PR...","[($448 million, COST), ($117 million, COST)]","[(mid-2017, START)]",[],[]
2,ARRB wins AAPA innovation award,"June 27, 2019",Holly Keys,Posted in Latest News Tagged...,Department of Transport and Main Roads’ Peter ...,"[(Main Roads Department, PROJECT), (Main Roads...",[],"[(2019, START)]","[(2019, END)]","[(Matthew Bereni, CONTRACTOR)]"
3,Heavy vehicle decoupling facility opens in Gat...,"June 1, 2021",Tara Hamid,"Posted in Industry News, Latest News ...",Picture courtesy of Queensland Government The ...,"[(Gatton Heavy Vehicle Decoupling Facility, PR...",[],[],[],"[(Gatton Heavy Vehicle Decoupling Facility, CO..."
4,ResourceCo: Ready to support government’s gree...,"June 14, 2021",Shanna Wong,"Posted in Environment, Industry News, Latest N...",ResourceCo’s Resource Recovery Facility in Wet...,"[(Resource Recovery Facility, PROJECT), (Eskin...","[($1.5 billion, COST)]","[(April 2021, START), (June, START)]","[(Fairweather, END)]","[(ResourceCo’s, CONTRACTOR), (Jim Fairweather,..."
...,...,...,...,...,...,...,...,...,...,...
4586,Delivery partner model used on W2B Pacific Hwy...,"October 13, 2015",pcm_admin,Posted in Project Report,"Dating back to 1896, the modern Olympic Games ...","[(Olympic Games, PROJECT), (Games’, PROJECT), ...",[],"[(2011, START), (1996, START), (July last year...","[(2011, END), (2020, END)]","[(London 2012, CONTRACTOR), (Bob Higgins, CONT..."
4587,EOI open for freeway upgrades on $2.3B METRONE...,"August 15, 2017",pcm_admin,"Posted in Asphalt Review, Industry News, Lates...",The Western Australia Government is seeking ex...,"[(Kwinana Freeway, PROJECT), (Russell Road, PR...","[($2.3 billion, COST)]","[($2.3 billion, START), (late 2018, START), (m...","[(late 2019, END), (early 2020, END), (mid-201...","[(Rita Saffioti, CONTRACTOR), (mid-2019., CONT..."
4588,City resilience holding Brisbane back – new re...,"September 21, 2017",pcm_admin,"Posted in Civil Works, Industry News, Latest N...","According to a new report released by Arcadis,...","[(Brisbane’s, PROJECT), (City Executive South ...",[],[],[],"[(Arcadis, Brisbane, CONTRACTOR), (Louisa Cart..."
4589,Two-hundred million bottles with Alex Fraser,"June 17, 2020",Lauren Jones,"Posted in Industry News, Latest News ...",A new glass additive bin at Alex Fraser’s Clar...,"[(Clarinda recycling facility, PROJECT), (Reso...","[($4.67 million, COST), ($336,500 grant, COST)]","[(early 2021, START)]","[(production annually, END), (Melbourne, END),...","[(Alex Fraser, CONTRACTOR), (Matt Genever, CON..."


In [37]:
# exporting it to csv file
roadsonline[['wrap','date article', 'author', 'category', 'content', 'project',
       'cost', 'start', 'end', 'contractor']].to_csv('ROADSONLINE.csv')

## Specification 2

In this specification, to improve the efficiency of the model, we will accummulate all training data, with all new labels and will train the model altogether cumulatively with the available training data. This will enable the algorithm to better distinguish between labels, increase the efficiency of NER, understand the location of specific words in the sentence much better. As a side note, you can add more training data, just make sure to be consistent in the format the training data is constructed.

In [38]:
import spacy
nlp=spacy.load("en_core_web_sm") 

# Getting the ner component
ner=nlp.get_pipe('ner')

In [39]:
from spacy.training.example import Example

In [40]:
LABEL = "PROJECT"
LABEL1 = "COST"
LABEL2 = "START"
LABEL3 = "END"
LABEL4 = "CONTRACTOR"


TRAIN_DATA=[

    ('The cost of building Melbourne’s new West Gate Tunnel has soared to at least $10 billion as a stoush between Transurban, two builders and the Victorian government over who should stump up billions of dollars to finish the troubled project heats up.', {"entities": [(37, 53, "PROJECT")]}),
    ('Major construction on the four-kilometre Warringah Freeway upgrades will begin in late 2021, while some work will begin as early as March.',{"entities": [(41, 67, "PROJECT")]}),
    ('The new Western Sydney Airport rail link to St Marys station has been given planning approval by the NSW government.',{"entities": [(8, 40, "PROJECT")]}),
    ('The Palaszczuk Government has scrapped plans for a $250 million underground busway at Roma St as part of its Cross River Rail project.',{"entities": [(109, 125, "PROJECT")]}),
    ('The $630 million Albion Park Rail Bypass has reached a new milestone, with the northbound lanes between Yallah and Oak Flats opened to traffic from Saturday.',{"entities": [(17, 40, "PROJECT")]}),
    ('Feedback submissions from Queenslanders are encouraged on the $1 billion Winchester South coal mine located 30 km southeast of Moranbah in the Bowen Basin.',{"entities": [(73, 99, "PROJECT")]}),
    ('The Australian Rail Track Corporation (ARTC) has awarded John Holland the design and construction contract for the Botany Rail Duplication Project.',{"entities": [(115, 146, "PROJECT")]}),
    ('Tenders ACT has issued a Notice of Upcoming Tender for the design and construction of the John Gorton Drive and Molonglo River Bridge Crossing project.',{"entities": [(90, 150, "PROJECT")]}),
    ('The executive director of the North-South Corridor project, Susana Fueyo, will brief local suppliers and subcontractors on the $9.9bn next stage of development.',{"entities": [(30, 58, "PROJECT")]}),
    ('Energy Developments Limited has secured the contract to build and operate the Jabiru hybrid power station.',{"entities": [(78, 105, "PROJECT")]}),
    ('Coleman Rail in partnership with KBR has been selected to deliver Stage Two of the Shepparton Line Upgrade.',{"entities": [(83, 106, "PROJECT")]}),
    ('According to Rail Projects Victoria, the Shepparton Line Upgrade is set to create 600 jobs over the three stages.',{"entities": [(41, 64, "PROJECT")]}),
    ('ACCIONA has secured approval from the Queensland Government for its Material Change of Use for the Aldoga Solar Farm development 20 km north west Gladstone, Queensland.',{"entities": [(99, 116, "PROJECT")]}),
    ('The proponents Bowen Pipeline Company(BPC)are pushing for co-ordinated project status and aiming for a mid-2023 start to construction for the Bowen Pipeline Project.',{"entities": [(142, 164, "PROJECT")]}),
    ('Publicly-owned generator Stanwell Corporation is partnering with the Japanese Iwatani Corporation to develop the three-gigawatt Aldoga renewable hydrogen facility (ARHP) in Aldoga, west of Gladstone.',{"entities": [(128, 162, "PROJECT")]}),
    ('Rio Tinto has awarded NRW a contract for the delivery of a 34MW Solar PV System at the Gudai Darri mine.',{"entities": [(87, 103, "PROJECT")]}),
    ('The Pacific Industrial Company recently secured the steel fabrication package for the Public Transport Authority (PTA) of Western Australia’s METRONET Airport Central Station Project at Perth Airport.',{"entities": [(142, 182, "PROJECT")]}),
    ('The Forrestfield-Airport Link is scheduled to open in the second half of 2021, and provide a 20 minute direct link between the eastern foothills, central business district, wider public transport network and airport.',{"entities": [(4, 29, "PROJECT")]}),
    ('Proponent SE Waroona Development Pty Ltd - part of Victorian-based company South Energy - is behind the proposed Waroona Solar Farm 11km south-west of Waroona and 30 km south of Pinjarra in Western Australia and spanning 300 hectares.',{"entities": [(113, 131, "PROJECT")]}),
    ('Transport for NSW (TfNSW) called for a team of design and construction Non-Owner Participants to form a Project Alliance to procure and deliver the design and construction of the Heathcote Bridge Widening project.',{"entities": [(179, 204, "PROJECT")]}),
    ('Magnetite Mines has released a pre-feasibility study (PFS) for its flagship Razorback project in South Australia.',{"entities": [(76, 93, "PROJECT")]}),
    ('The Federal Government is investing over $10 billion in a safer, smoother and more reliable Bruce Highway.',{"entities": [(92, 105, "PROJECT")]}),
    ('Weather permitting, the Genex Kidston Connection Project is due to be completed in early 2024.',{"entities": [(24, 56, "PROJECT")]}),
    ('Major construction is scheduled to commence soon on the $176.2 million Bendigo and Echuca Line Upgrade in Victoria following the appointment of John Holland to deliver the track upgrades.',{"entities": [(71, 102, "PROJECT")]}),
    ('The new Goornong Station is projected to be completed by the end of 2021',{"entities": [(8, 24, "PROJECT")]}),
    ('while the new Huntly Station to be completed in mid-2022.',{"entities": [(14, 28, "PROJECT")]}),
    ('The Raywood Station on the Swan Hill Line is due to be completed at the end of 2022. ',{"entities": [(4, 19, "PROJECT")]}),
    ('Western Australia’s $230 million Swan River Crossings project in Fremantle is a step closer to construction following the unveiling of the new alignment that was the result of an extensive community consultation.',{"entities": [(33, 61, "PROJECT")]}),
    ('The Australian Rail Track Corporation (ARTC) has marked a new milestone on the $400 million Port Botany Rail Line Duplication and Cabramatta Loop Project with the awarding of two major contracts.',{"entities": [(92, 153, "PROJECT")]}),
    ('Another METRONET project is moving forward as construction commences on the new Lakelands Station, which is located between the Warnbro and Mandurah stations in Western Australia.',{"entities": [(80, 97, "PROJECT")]}),
    ('The approximately $3.5 billion Iron Bridge Magnetite Project is located 145 kilometres south of Port Hedland in the Pilbara region.',{"entities": [(31, 60, "PROJECT")]}),
    ('Weather permitting, works on the car park upgrade is due to be completed by late 2022, with the entire Ferny Grove project completed by late 2023.',{"entities": [(103, 114, "PROJECT")]}),
    ('Feedback from the community is going to be considered as part of the detailed design for the $327.5 million Stage 1 upgrade on the Mooloolah River Interchange.',{"entities": [(131, 158, "PROJECT")]}),
    ('The State Government is investing $84 million to deliver upgrades to the East and West Tamar Highways',{"entities": [(73, 101, "PROJECT")]}),
    ('The cost of building Melbourne’s new West Gate Tunnel has soared to at least $10 billion as a stoush between Transurban, two builders and the Victorian government over who should stump up billions of dollars to finish the troubled project heats up.', {"entities": [(77, 88, "COST")]}),
    ('The Palaszczuk Government has scrapped plans for a $250 million underground busway at Roma St as part of its Cross River Rail project.',{"entities": [(51, 63, "COST")]}),
    ('The $630 million Albion Park Rail Bypass has reached a new milestone, with the northbound lanes between Yallah and Oak Flats opened to traffic from Saturday.',{"entities": [(4, 14, "COST")]}),
    ('Feedback submissions from Queenslanders are encouraged on the $1 billion Winchester South coal mine located 30 km southeast of Moranbah in the Bowen Basin.',{"entities": [(62, 72, "COST")]}),
    ('Metso Outotec stated orders like these are usually worth around €15–20 million ($23–31 million), but a specific figure for this contract wasn’t disclosed.',{"entities": [(80, 94, "COST")]}),
    ('The $15 million expansion project will increase the facility’s capacity by about 50 per cent, with Jemena set to transport gas from the plant on behalf of Senex Energy to its customer GLNG.',{"entities": [(4, 15, "COST")]}),
    ('Interested parties have been given notice that tendering will soon commence on a $175 million road upgrade project in the ACT.',{"entities": [(81, 93, "COST")]}),
    ('The project is part of the $4 billion Regional Rail Revival program, which is upgrading every regional passenger rail line in Victoria and creating 3000 jobs and supplier opportunities.',{"entities": [(27, 37, "COST")]}),
    ('The $550 million project is anticipated to support up to 350 construction jobs and about 10 ongoing operational jobs.',{"entities": [(4, 16, "COST")]}),
    ('The Mid-West/ Wheatbelt Joint Development Assessment Panel approved the $250 million Waroona project on November 15 under conditions',{"entities": [(72, 84, "COST")]}),
    ('“The total project cost is $73 million,” Mr Jordan said.',{"entities": [(27, 38, "COST")]}),
    ('A key study has put a pricetag of up to $675 million on a south Australian iron ore project.',{"entities": [(40, 52, "COST")]}),
    ('The Federal Government is investing over $10 billion in a safer, smoother and more reliable Bruce Highway.',{"entities": [(42, 53, "COST")]}),
    ('Major construction is scheduled to commence soon on the $176.2 million Bendigo and Echuca Line Upgrade in Victoria following the appointment of John Holland to deliver the track upgrades.',{"entities": [(56, 70, "COST")]}),
    ('Western Australia’s $230 million Swan River Crossings project in Fremantle is a step closer to construction following the unveiling of the new alignment that was the result of an extensive community consultation.',{"entities": [(20, 32, "COST")]}),
    ('The Australian Rail Track Corporation (ARTC) has marked a new milestone on the $400 million Port Botany Rail Line Duplication and Cabramatta Loop Project with the awarding of two major contracts.',{"entities": [(79, 91, "COST")]}),
    ('The $82 million metallurgical coal mine extension project is expected to support approximately 550 jobs in the central Queensland coalfields.',{"entities": [(4, 15, "COST")]}),
    ('The approximately $3.5 billion Iron Bridge Magnetite Project is located 145 kilometres south of Port Hedland in the Pilbara region.',{"entities": [(18, 30, "COST")]}),
    ('The $14 million project is jointly funded by the Australian and Western Australian Governments along with third-party stakeholders',{"entities": [(4, 15, "COST")]}),
    ('Major construction has commenced on the $140 million Transit Oriented Development (TOD) at the Ferny Grove Train Station in Queensland, with works to be delivered by Queensland-based developer Honeycombes Property Group.',{"entities": [(40, 52, "COST")]}),
    ('Construction on an $11 million project that will replace the existing Rocky Creek Bridge that was built in 1928 is set to commence in August of this year.',{"entities": [(19, 30, "COST")]}),
    ('WA Premier, Mark McGowan, said that early works on the project are expected to commence later in 2021, with major construction kicking off in early 2022. Mr McGowan added that the project will support more than 1,400 local jobs.', {"entities": [(142, 152, "START")]}),
    ('The NSW Government will start tendering for contracts and hopes to begin construction in the first quarter of 2021.',{"entities": [(110, 114, "START")]}),
    ('Project updates for the so-called Western Harbour Tunnel between Rozelle and North Sydney, and Beaches Link motorway between Cammeray and Seaforth and Balgowlah, say construction on both roads will start in 2020-21, with the roads opening in 2025-26.',{"entities": [(207, 214, "START")]}),
    ('The first stage of construction on the new airport began on 24 September 2018, and the first stage is expected to be complete and open by December 2026.',{"entities": [(60, 77, "START")]}),
    ('Major construction on the four-kilometre Warringah Freeway upgrades will begin in late 2021, while some work will begin as early as March.',{"entities": [(82, 91, "START")]}),
    ('Major construction is expected to begin in October',{"entities": [(43, 50, "START")]}),
    ('The contract will start and end in 2022, with plant production expected late in the final quarter of the year.',{"entities": [(35, 39, "START")]}),
    ('Construction for the transmission line is expected to commence in April, 2022, for completion by early 2024.',{"entities": [(66, 77, "START")]}),
    ('Early works have begun on an $80 million rail infrastructure project south of Perth for which the head contractor is locked in and main works will begin in October.',{"entities": [(156, 163, "START")]}),
    ('Following site mobilisation in the second half of 2021, major construction works on both the Botany Rail Duplication and Cabramatta Loop are scheduled to commence in the first quarter of 2022.',{"entities": [(170, 191, "START")]}),
    ('The Federal Department of Infrastructure, Transport, Regional Development and Communications said construction was expected to commence in late 2022 and be completed by late 2025.',{"entities": [(139, 148, "START")]}),
    ('With the combined funds received from the placement and SPP, the company now has the financial capacity to meet its share of final feasibility study costs and commitment to early lead time items under the joint venture with Minotaur and allow construction and mining to commence at the project in 2022 once a final decision to proceed is made by the joint venture partners.',{"entities": [(297, 301, "START")]}),
    ('Early works on the project is anticipated to commence in 2022 with construction expected to take 18 months to complete. ',{"entities": [(57, 61, "START")]}),
    ('The proponents Bowen Pipeline Company(BPC)are pushing for co-ordinated project status and aiming for a mid-2023 start to construction for the Bowen Pipeline Project.',{"entities": [(103, 111, "START")]}),
    ('Expected commencement of construction is December, 2021.',{"entities": [(41, 55, "START")]}),
    ('NRW said design and procurement would commence immediately and be followed by commencement of construction in August 2021.',{"entities": [(110, 121, "START")]}),
    ('Subject to approvals construction is expected to start in 2021 with full operation in 2022.',{"entities": [(58, 62, "START")]}),
    ('Construction for early works is scheduled to commence next year, subject to all relevant approvals.',{"entities": [(54, 63, "START")]}),
    ('“It is anticipated that construction of stage 1 would begin in 2024 with the first operations to commence in 2025,” Spark Infrastructure said.',{"entities": [(63, 67, "START")]}),
    ('Early construction activities are set to commence in late 2021 and major construction in 2022. ',{"entities": [(53, 62, "START")]}),
    ('“Core is excited to continue accomplishing crucial milestones for the project as we prepare to commence construction before the end of 2021, and deliver first production in 2022,” added Biggins. ',{"entities": [(128, 139, "START")]}),
    ('Weather and construction conditions permitting, early works will begin in early 2022 and completed by 2023 while the entire Stage 1 of the B2N project is expected to be completed by 2025. ',{"entities": [(74, 84, "START")]}),    
    ('“With construction expected to begin in the first half of 2022, we are keen to start the conversation with local businesses on ways they could support the project. ',{"entities": [(44, 62, "START")]}),    
    ('Subject to relevant State and Federal Government approvals, initial works on this project is anticipated to commence in 2023. ',{"entities": [(120, 124, "START")]}),
    ('“We anticipate early works to commence later this year, with major construction kicking off in early 2022 ',{"entities": [(95, 105, "START")]}),
    ('Both companies are expected to mobilise on site in the second half of 2021 and major construction on the Port Botany Rail Line Duplication and Cabramatta Loop projects are anticipated to commence in the first quarter of 2022.  ',{"entities": [(203, 224, "START")]}),
    ('Early works on this development commenced in December 2020. Major construction is due to begin in the coming months and the tunnelling contract to be awarded by the end of this year, with the massive tunnel boring machines to begin digging by the end of 2023.',{"entities": [(95, 115, "START")]}),
    ('Construction on an $11 million project that will replace the existing Rocky Creek Bridge that was built in 1928 is set to commence in August of this year. ',{"entities": [(134, 140, "START")]}),
    ('Transurban has not given a new completion date for the West Gate Tunnel, which was originally scheduled to be open by the end of 2022.',{"entities": [(129, 133, "END")]}),
    ('The state government predicts 36,000 people will transfer between trains and buses at Roma Street station daily by the time the project is completed in 2024.',{"entities": [(152, 156, "END")]}),
    ('Construction on the twin tunnels will begin early next year, and be completed by 2025, a year later than the government first intended.',{"entities": [(81, 85, "END")]}),
    ('Project updates for the so-called Western Harbour Tunnel between Rozelle and North Sydney, and Beaches Link motorway between Cammeray and Seaforth and Balgowlah, say construction on both roads will start in 2020-21, with the roads opening in 2025-26.',{"entities": [(242, 249, "END")]}),
    ('The first stage of construction on the new airport began on 24 September 2018, and the first stage is expected to be complete and open by December 2026.',{"entities": [(138, 151, "END")]}),
    ('The contract will start and end in 2022, with plant production expected late in the final quarter of the year.',{"entities": [(35, 39, "END")]}),
    ('Premier Mark McGowan says the battery would likely be the second biggest in Australia once complete in 2022z, although it would largely depend on the status of other big battery projects elsewhere, including the mooted replacement for the Liddell coal generator in NSW.',{"entities": [(103, 107, "END")]}),
    ('The project completion is expected in early 2022',{"entities": [(38, 48, "END")]}),
    ('Construction for the transmission line is expected to commence in April, 2022, for completion by early 2024.',{"entities": [(97, 107, "END")]}),
    ('The Federal Department of Infrastructure, Transport, Regional Development and Communications said construction was expected to commence in late 2022 and be completed by late 2025.',{"entities": [(169, 178, "END")]}),
    ('EDL said the microgrid’s diesel generators were scheduled to come online by the end of 2021 and the solar farm in early 2022, with the entire hybrid power station expected to be fully commissioned by February 2022.',{"entities": [(200, 213, "END")]}),
    ('Stage Two of the Shepparton Line Upgrade is targeted for completion in late 2022.',{"entities": [(71, 80, "END")]}),
    ('Construction and commissioning is scheduled for completion in early 2022.',{"entities": [(62, 72, "END")]}),
    ('The Forrestfield-Airport Link is scheduled to open in the second half of 2021, and provide a 20 minute direct link between the eastern foothills, central business district, wider public transport network and airport.',{"entities": [(58, 77, "END")]}),
    ('Subject to approvals construction is expected to start in 2021 with full operation in 2022.',{"entities": [(86, 90, "END")]}),
    ('Magnetite CEO Peter Schubert said with a small-scale start-up, the company is planning to ship its first iron ore product from the Razorback project at the end of 2024. ',{"entities": [(156, 167, "END")]}),
    ('Project director Frank Fisseler was unavailable for comment, but Projectory understands that the WEP Project team are planning for the majority of the work to be completed by 2023.',{"entities": [(175, 179, "END")]}),
    ('“It is anticipated that construction of stage 1 would begin in 2024 with the first operations to commence in 2025,” Spark Infrastructure said.',{"entities": [(109, 113, "END")]}),
    ('The new Pimpama station is due to be completed in time for the opening of CRR in 2025. ',{"entities": [(81, 85, "END")]}),
    ('The project is expected to create 100 jobs during a projected two-year construction phase and commissioning period, with completion expected in September, 2022.',{"entities": [(144, 159, "END")]}),
    ('Weather and construction conditions permitting, early works will begin in early 2022 and completed by 2023 while the entire Stage 1 of the B2N project is expected to be completed by 2025. ',{"entities": [(182, 186, "END")]}),
    ('Weather permitting, the Genex Kidston Connection Project is due to be completed in early 2024. ',{"entities": [(83, 93, "END")]}),
    ('The new Goornong Station is projected to be completed by the end of 2021',{"entities": [(61, 72, "END")]}),
    ('while the new Huntly Station to be completed in mid-2022.',{"entities": [(48, 56, "END")]}),
    ('The Raywood Station on the Swan Hill Line is due to be completed at the end of 2022. ',{"entities": [(72, 83, "END")]}),
    ('The new ETO will allow for additional weekday return services to both Epsom and Eaglehawk as well as triple the number of weekday train services to and from Echuca once the project is completed in late 2022. ',{"entities": [(197, 206, "END")]}),
    ('Weather and construction conditions permitting, this Bruce Highway upgrade is expected to be completed by late 2023. ',{"entities": [(106, 115, "END")]}),
    ('SIMPEC will commence works immediately, with project completion expected by mid-2022. ',{"entities": [(76, 84, "END")]}),
    ('Works are undertaken by WBHO Infrastructure Pty Ltd and construction is due to be completed in the first half of 2022. ',{"entities": [(99, 117, "END")]}),
    ('Weather permitting, works on the car park upgrade is due to be completed by late 2022, with the entire Ferny Grove project completed by late 2023. ',{"entities": [(136, 145, "END")]}),
    ('Contractor Fulton Hogan was chosen to build the multi-million dollar bypass after a competitive tender process that shortlisted three companies, which includes CPB Contractors and a joint venture between BMD Constructions and John Holland.',{"entities": [(11, 23, "CONTRACTOR")]}),
    ('Northern Star Resources has appointed GR Engineering Services as its engineering, procurement and construction contractor at the Thunderbox gold operations in Western Australia.',{"entities": [(38, 61, "CONTRACTOR")]}),
    ('The Australian Rail Track Corporation (ARTC) has awarded John Holland the design and construction contract for the Botany Rail Duplication Project.',{"entities": [(57, 69, "CONTRACTOR")]}),
    ('Energy Developments Limited has secured the contract to build and operate the Jabiru hybrid power station.',{"entities": [(0, 27, "CONTRACTOR")]}),
    ('Juwi Renewable Energy in turn has been appointed to construct the 3.9 MW solar farm. ',{"entities": [(0, 21, "CONTRACTOR")]}),
    ('Coleman Rail in partnership with KBR has been selected to deliver Stage Two of the Shepparton Line Upgrade.',{"entities": [(0, 12, "CONTRACTOR")]}),
    ('Rio Tinto has awarded NRW a contract for the delivery of a 34MW Solar PV System at the Gudai Darri mine.',{"entities": [(22, 25, "CONTRACTOR")]}),
    ('The Pacific Industrial Company recently secured the steel fabrication package for the Public Transport Authority (PTA) of Western Australia’s METRONET Airport Central Station Project at Perth Airport.',{"entities": [(4, 30, "CONTRACTOR")]}),
    ('Barminco signed a Letter of Intent (LOI) with Panoramic Resources Limited for a $280 million four-year contract, now finalised, to begin works at the Savannah Nickel Project in WA.',{"entities": [(0, 8, "CONTRACTOR")]}),
    ('Core Lithium has signed a two-year deal with the NT Power and Water Corporation (PWC) to connect its Finniss project to the power grid.',{"entities": [(0, 12, "CONTRACTOR")]}),
    ('Works are undertaken by WBHO Infrastructure Pty Ltd and construction is due to be completed in the first half of 2022. ',{"entities": [(24, 51, "CONTRACTOR")]}),
    ('Major construction has commenced on the $140 million Transit Oriented Development (TOD) at the Ferny Grove Train Station in Queensland, with works to be delivered by Queensland-based developer Honeycombes Property Group. ',{"entities": [(193, 219, "CONTRACTOR")]}),
    ('Construction of Ferny Grove Central is being undertaken by Broad Constructions – a wholly owned subsidiary of CIMIC Group company CPB Contractors',{"entities": [(59, 78, "CONTRACTOR")]})
    
    
]

In [41]:
# Adding all labels together
ner.add_label(LABEL)
ner.add_label(LABEL1)
ner.add_label(LABEL2)
ner.add_label(LABEL3)
ner.add_label(LABEL4)


# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

In [42]:
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

In [43]:
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

    sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
    for itn in range(30):
    # shuffle examples before training
        random.shuffle(TRAIN_DATA)
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=sizes)
        # ictionary to store losses
        losses = {}
        for batch in spacy.util.minibatch(TRAIN_DATA, size=2):
            for text, annotations in batch:
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp.update([example], losses=losses, drop=0.3)
                print("Losses", losses)
              # Calling update() over the iteration
            

Losses {'ner': 1.9658393460505437}
Losses {'ner': 9.025036982624925}
Losses {'ner': 16.938967378727263}
Losses {'ner': 20.826988421151707}
Losses {'ner': 31.57700361001206}
Losses {'ner': 35.01051160609266}
Losses {'ner': 39.8202700553872}
Losses {'ner': 41.32275131652302}
Losses {'ner': 44.0520579437212}
Losses {'ner': 45.67911585328409}
Losses {'ner': 51.086338826126266}
Losses {'ner': 52.7821191983505}
Losses {'ner': 55.885875440075104}
Losses {'ner': 57.57853593364056}
Losses {'ner': 59.392111674299}
Losses {'ner': 62.27590273837025}
Losses {'ner': 64.43441392738492}
Losses {'ner': 66.46950190680617}
Losses {'ner': 68.01518074117433}
Losses {'ner': 70.08999598046029}
Losses {'ner': 72.9628106809157}
Losses {'ner': 78.15869902266623}
Losses {'ner': 80.875355359588}
Losses {'ner': 82.79385030902077}
Losses {'ner': 86.25165023566414}
Losses {'ner': 88.00085503379036}
Losses {'ner': 89.96330703241591}
Losses {'ner': 93.3092435437049}
Losses {'ner': 95.30678462638465}
Losses {'ner': 97.



Losses {'ner': 328.3355334150797}
Losses {'ner': 328.7079484066696}
Losses {'ner': 330.5446623783202}
Losses {'ner': 332.05915163195016}
Losses {'ner': 332.34267240739837}
Losses {'ner': 334.30975001409973}
Losses {'ner': 335.3912065939713}
Losses {'ner': 337.54238208264496}
Losses {'ner': 339.9274708607535}
Losses {'ner': 341.6528903686449}
Losses {'ner': 343.0722863044921}
Losses {'ner': 344.70720883744673}
Losses {'ner': 346.7231499626235}
Losses {'ner': 348.40370573063257}
Losses {'ner': 350.1644392351096}
Losses {'ner': 352.7734081486709}
Losses {'ner': 354.73201776316205}
Losses {'ner': 355.87482399636633}
Losses {'ner': 357.90981970283775}
Losses {'ner': 23.153377696220417}
Losses {'ner': 24.377192445026072}
Losses {'ner': 26.594010322742086}
Losses {'ner': 27.16968246930238}
Losses {'ner': 33.23725855153783}
Losses {'ner': 35.262801419100114}
Losses {'ner': 35.97563560540523}
Losses {'ner': 37.90207892616703}
Losses {'ner': 39.54135384325515}
Losses {'ner': 42.65430988669935}
L

Since we don't have separate labels now, I will define a function called "pred", which will output all predictions/tags that the model can produce.

In [44]:
def pred(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

In [45]:
i=0
list_alltags=[]
while i<len(roadsonline):
    list_alltags.append(pred(str(roadsonline['content'][i])))
    i+=1

In [46]:
roadsonline['tags']=list_alltags

Now, we will apply the same technique for the projectory dataset content column. However, I will be using only the specification 2 on this dataset to save time. Note that before starting each model, I import the spacy builtin tool all over again to start a completely new model. But because now I am using specification 2 model on the second dataset, I do not need to train the model again. Instead I can just use the "pred" function on the items of the content column. This applies to any dataset. After this line, simply apply the "pred" function to any text column (you need to iterate it over all rows), to get the predictions of the specification 2 model.

In [47]:
# Make sure to change the list name each time to avoid confusion
i=0
list_alltags_new=[]
while i<len(projectory):
    list_alltags_new.append(pred(str(projectory['content'][i])))
    i+=1
# Add predicitons to the dataset
projectory['tags']=list_alltags_new

The exact code above applies to any dataset, by just changing the list name, dataset name and the column name.