# Training Spacy for Entity Recognition
"training to extract tech skills and entity names"

- <a href="https://medium.com/swlh/build-a-custom-named-entity-recognition-model-ussing-spacy-950bd4c6449f">Medium :: Build A Custom Named Entity Recog Model with Spacy</a>
- <a href="https://www.youtube.com/watch?v=IqOJU1-_Fi0">Youtube :: Intro to NLP with Spacy IV :: Named Entity Recog</a>
- <a href="https://spacy.io/usage/training">Spacy Docs :: Training</a>

- <a href="https://www.kdnuggets.com/2018/08/named-entity-recognition-practitioners-guide-nlp-4.html"> KDNuggets :: Named Entity Recognition Practitioner's Guide IV</a>
- <a href="https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da"> Toward DS :: Named Entity Recog w/ NLTK and Spacy</a>
- <a href="https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718">Toward DS :: Custom Named Entity Recogn Using Spacy</a>

In [None]:
import random
import warnings
import spacy

from spacy.util import minibatch, compounding
from spacy.lang.en import English
from pathlib import Path

nlp = spacy.load("en_core_web_sm")

NLP is basically a tokenizer made of three components:

In [None]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x7f0964a56d68>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x7f0964680888>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x7f09646808e8>)]

It reads text and poops out entities:

In [None]:
nlp("This LinkedIn project is a nightmare and I hope Professor Jurgens doesn't fail us").ents

(LinkedIn, Jurgens)

We can also render a highlighted tagging:

In [None]:
from spacy import displacy
displacy.render(nlp("This LinkedIn project is a nightmare and I hope Professor Jurgens doesn't fail us"), jupyter=True, style='ent')

***
We can also train a custom entity recognizer:
- Train data is a list of a bunch of tuple, the first item being a string, the second being an entities dictionary
- the entities dictionary is the answer key the recognizer gets trained on
- the value is a list of tuples, one per entity, including `starting_char_index`, `ending_char_index`, and `label`

FLATTEN entries. https://www.textfixer.com/tools/remove-line-breaks.php to remove line and paragraph breaks both

In [None]:
# x = "Degree in Machine Learning, Computer Science, Electrical Engineering, Physics, Statistics, Applied Math or other quantitative fields 0-2 years of working experience in deep learning and machine learning Ability to independently support existing products Proven track record in modifying and applying advanced algorithms to address practical problems Experience with Natural Language Processing, Natural Language Understanding, and the relevant open-source tools Proficient with advanced deep learning technology and open source tools such as pytorch or tensorflow Experience with regression, Neural Network, SVM, and/or ensemble methods Experience with generative modeling techniques such as GAN Experience in developing, modifying and experimenting advanced language models Experience in developing/applying/evaluating conversational AI technology Proven ability to work independently on development of complex models with extremely large and complex data structures Proficient in more than one of Python, R, Java, C++, or C Robust knowledge and experience with statistical methods Extensive knowledge of SQL Experience with Hadoop and NoSQL related technologies such as Map Reduce, Spark, Hive, HBase, mongoDB, Cassandra, etc. Experience with online, mobile marketing analytics Experience with GPU programming Solid knowledge of Bayesian statistical inference and related machine learning methods. Experience with Agile methods for software development"
# x


In [None]:
# targets = ["Machine Learning"]

# for each in targets:
#     L = x.find(each)
#     R = x.find(each)+len(each)
#     print(L, R, x[L:R])

In [None]:
### further testing cell

# x="Part of the Domo promise is to give customers leverage “in record time.” This is also how our product team operates; For candidates who want to excel in our fast-paced environment we expect: Independent, self-starters who are motivated by challenges, driven by natural curiosity, and interested in solving big problems; Experience designing digital products, on the web, mobile, and even native applications (preference to experience designing enterprise software); Comfortable with the ‘full-stack’ of design skills, from needs finding, wireframing, designing, writing, prototyping, and presenting experiences; Effective at understanding and breaking down complex ideas into simple ones, and combining them into novel solutions; Collaborates well with both designers and stakeholders, and communicates effectively when presenting and in writing; Expertly navigates disagreements between usability and feasibility, and resolves roadblocks without alienating stakeholders; Familiarity with designing and using design systems in tools like Sketch, Figma, and Adobe XD; Learns quickly by jumping into the deep end, is unafraid of asking questions, and doesn’t quit before mastering both design skill and product know-how."
# x[331:340]

## Nico's Train Data

In [None]:
NICO_TRAIN_DATA = [
    ("Experience performing internal and external assessments Experience in leading a team during penetration tests Knowledge of server (Linux, Windows) and client (Windows, OS X, Linux) operating systems Knowledge and understanding of attack surfaces for enterprise systems and services Experience in at least one of PHP/Hack, Python, C/C++, Go or Java Experience working in cross-functional programs Experience translating technical concepts into language that is understood to audiences including software engineers, business and technical leaders 5+ years of experience practicing application security assessments and penetration tests Experience performing and leading whitebox and blackbox style assessments Experience with complex, multi-stage, multi-person pentests for new internal customers or external vendors Networking knowledge, including network virtualization technologies and ideally IPv6",
        {"entities": [(70, 84, "Skill"), (92, 109, "Skill"), (131, 136, "ProgLang"), (138, 145, "ProgLang"), (230, 245, "Skill"), 
                      (312, 315, "ProgLang"), (316, 320, "ProgLang"), (322, 328, "ProgLang"), (330, 335, "ProgLang"), (337, 339, "ProgLang"),
                      (343, 347, "ProgLang"), (591, 610, "Skill"), (815, 825, "Skill")]}),
                   
    ("Masters or Ph.D. with a specialization in deep learning, machine learning, artificial intelligence, computer vision, statistics, applied math, algorithm design, or a related quantitative field. Proficiency in statistics, machine learning, and model analytics. A minimum of 2 years’ hands-on applied or research experience developing machine learning models on large scale data sets. Able to translate academic research into application. Intermediate knowledge in computer architecture/compilers/OOP/data structure and algorithms. Basic understanding of natural language processing Advanced knowledge of SQL analysis. Knowledge of GIT/version control, SQL DB architecture, and Hadoop/Map Reduce/Spark Proficiency in one or more of the higher-level programming languages like Python, Java, C++, Scala, R, etc. Strong problem-solving, written and spoken communication skills. Strong understanding of natural language processing. Experience with Blizzard Entertainment games. A proven track record of original contributions to machine learning or statistics. Experience with data visualization tools like Tableau. Passion for video games",
        {"entities": [(0, 7, "Education"), (11, 16, "Education"), (42, 55, "Skill"), (57, 73, "Skill"), (75, 98, "Skill"), (100, 115, "Skill"), 
                      (117, 127, "Skill"), (243, 258, "Skill"), (302, 310, "Skill"), (485, 494, "Skill"), (495, 498, "ProgLang"), (499, 513, "Skill"),
                      (553, 580, "Skill"), (603, 606, "ProgLang"), (630, 633, "ProgLang"), (634, 649, "Skill"), (676, 682, "ProgLang"), (683, 693, "ProgLang"),
                      (694, 699, "ProgLang"), (774, 780, "ProgLang"), (782, 786, "ProgLang"), (788, 791, "ProgLang"), (793, 798, "ProgLang"),
                      (851, 871, "Skill"), (815, 830, "Skill"), (1071, 1089, "Skill"), (1101, 1108, "Skill")]}),
                   
    ("Degree in Machine Learning, Computer Science, Electrical Engineering, Physics, Statistics, Applied Math or other quantitative fields 0-2 years of working experience in deep learning and machine learning Ability to independently support existing products Proven track record in modifying and applying advanced algorithms to address practical problems Experience with Natural Language Processing, Natural Language Understanding, and the relevant open-source tools Proficient with advanced deep learning technology and open source tools such as pytorch or tensorflow Experience with regression, Neural Network, SVM, and/or ensemble methods Experience with generative modeling techniques such as GAN Experience in developing, modifying and experimenting advanced language models Experience in developing/applying/evaluating conversational AI technology Proven ability to work independently on development of complex models with extremely large and complex data structures Proficient in more than one of Python, R, Java, C++, or C Robust knowledge and experience with statistical methods Extensive knowledge of SQL Experience with Hadoop and NoSQL related technologies such as Map Reduce, Spark, Hive, HBase, mongoDB, Cassandra, etc. Experience with online, mobile marketing analytics Experience with GPU programming Solid knowledge of Bayesian statistical inference and related machine learning methods. Experience with Agile methods for software development",
        {"entities": [(10, 26, "Skill"), (28, 44, "Skill"), (46, 68, "Skill"), (70, 77, "Skill"), (79, 89, "Skill"), (91, 103, "Skill"), (168, 181, "Skill"),
                      (186, 202, "Skill"), (291, 319, "Skill"), (366, 393, "Skill"), (395, 425, "Skill"), (542, 549, "ProgLang"), (553, 563, "ProgLang"), 
                      (580, 590, "Skill"), (592, 606, "Skill"), (608, 611, "Skill"), (620, 626, "Skill"), (653, 672, "Skill"), (692, 695, "ProgLang"), (759, 774, "Skill"),
                      (820, 837, "Skill"), (999, 1005, "ProgLang"), (1007, 1008, "ProgLang"), (1010, 1014, "ProgLang"), (1016, 1019, "ProgLang"), (1106, 1109, "ProgLang"), 
                      (1126, 1132, "ProgLang"), (1137, 1142, "ProgLang"), (1172, 1182, "ProgLang"), (1184, 1189, "ProgLang"), (1191, 1195, "ProgLang"), (1197, 1202, "ProgLang"),
                      (1204, 1211, "ProgLang"), (1213, 1222, "ProgLang"), (1260, 1279, "Skill"), (1296, 1311, "Skill"), (1331, 1361, "Skill"), (1416, 1429, "Skill"),
                      (1434, 1454, "Skill")]}),
                   
    ("Mastery of Python. Ability to design, implement, and document scalable code. Deep understanding and intuition around machine learning techniques, statistical, and predictive modelling. Bachelor of Science Degree in related discipline. Willingness and ability to travel OCONUS (up to 25%) for customer engagements. Driven, self-directed personality. Strong sense of mission and commitment to making a difference. Eligibility and willingness to obtain a US Security Clearance. Active TS/SCI Security Clearance. Experience with RF, 802.11, or Bluetooth emission and transmission protocols. Knowledge of data pipelines. Experience with direct application to computer vision, particularly object recognition and interaction tagging. Experience with embedded / mobile systems design. Advanced (MS or PhD) degrees in mathematics, physics, statistics, machine learning, or related quantitative disciplines.",
        {"entities": [(11, 17, "ProgLang"), (62, 75, "Skill"), (117, 133, "Skill"), (163, 183, "Skill"), (185, 204, "Education"), 
                      (455, 473, "Skill"), (600, 614, "Skill"), (654, 669, "Skill"), (684, 702, "Skill"), (707, 726, "Skill"), 
                      (788, 790, "Education"), (794, 797, "Education")]}),
                   
    ("PhD or equivalent Master's Degree plus 4+ years of experience in a quantitative field. Strong analytical skills. 2+ years of experience of building predictive models for business and proficiency in model development and model validation. Experience in efficiently handling large data sets, e.g., by using SQL, and databases in a business environment. Experience with R, Python, Matlab or other scripting languages. Experience with time series modeling and machine learning forecasting. Experience with price modeling.",
        {"entities": [(0, 3, "Education"), (18, 26, "Education"), (94, 111, "Skill"), (198, 215, "Skill"), (220, 236, "Skill"), (264, 288, "Skill"),
                      (305, 308, "ProgLang"), (367, 368, "ProgLang"), (370, 376, "ProgLang"), (378, 384, "ProgLang"), (394, 413, "Skill"), (431, 451, "Skill"),
                      (456, 472, "Skill"), (473, 484, "Skill"), (502, 516, "Skill")]}),

    ("Good knowledge of the most common Python libraries for data mining Experience in scraping data (eg: using selenium), parsing and cleaning data using text mining tools Experience in data visualization and analysis Experience with Tensorflow 2.0 Experience with graph databases such as Neo4J Experience with Spark is a plus Basic knowledge of SQL",
        {"entities": [(34, 40, "ProgLang"), (55, 66, "Skill"), (81, 84, "Skill"), (106, 114, "ProgLang"), (129, 142, "Skill"), 
                      (149, 160, "Skill"), (181, 199, "Skill"), (229, 239, "ProgLang"), (284, 289, "ProgLang"), (306, 311, "ProgLang"), 
                      (341, 344, "ProgLang")]}),

    ("2+ years of experience doing quantitative analysis at a consumer-facing technology company Experience with data pipelining, from data preparation to analysis to deployment Experience writing and optimizing complex SQL queries on large data sets Knowledge of inferential statistics, especially in an experimentation setting (hypothesis testing, power analysis, experimental design) Experience communicating the results of analyses with product and leadership teams to influence the strategy of the product Experience manipulating large data sets through SQL or Python Development experience in Python Experience developing in Looker Natural Language processing (NLP) with large data sets Experience in big data technologies like Spark",
        {"entities": [(29, 50, "Skill"), (107, 122, "Skill"), (129, 145, "Skill"), (161, 171, "Skill"), (214, 217, "ProgLang"), 
                      (258, 280, "Skill"), (324, 342, "Skill"), (344, 358, "Skill"), (360, 379, "Skill"), (516, 544, "Skill"), 
                      (625, 631, "ProgLang"), (661, 664, "Skill"), (701, 709, "Skill"), (728, 733, "ProgLang")]}),

    ("Requirements Bachelor’s required Data analysis experience, comfortable with data validation, processing, and maintenance tasks. Proven success record with leading and/or participating in high-level, complicated projects. Ability to analyze data to glean insights that can be used in setting strategy, decision making and validation Ability to work collaboratively on high-performance teams and effectively support cross-functional teams in a matrix environment. Ability to manage multiple projects and priorities with tight deadlines and limited supervision. Attention to detail. Excellent interpersonal and customer service skills with strong written, verbal, and listening skills, including proficiency in public speaking. Strong critical thinking skills and ability to perform critical analysis with limited oversight. Advanced proficiency in Microsoft Office Suite, specifically Excel, Word and PowerPoint; experience with Tableau. Strong skills in oral and written communication including experience in effectively communicating with executives, employees and colleagues.",
        {"entities": [(13, 23, "Education"), (32, 46, "Skill"), (76, 91, "Skill"), (155, 162, "Skill"), (301, 316, "Skill"), 
                      (343, 363, "Skill"), (473, 479, "Skill"), (559, 578, "Skill"), (708, 723, "Skill"), (732, 749, "Skill"), 
                      (846, 862, "ProgLang"), (927, 934, "ProgLang"), (970, 983, "Skill")]}),

    ("3+ years of overall IT experience including with Linux or Unix based OS 2+ Strong experience with a programming language such as Java, C, C++, Python, Go and experience scripting (i.e., Bash) 2+ years of experience with CI/CD components, such as Terraform, Ansible, Chef, and Jenkins. Experience with observability, monitoring, alerting, and dashboards, using tooling such as Grafana, Kibana, Prometheus, PagerDuty, and Sysdig Strong debug skills, effective verbal and written communication skills, team oriented",
        {"entities": [(49, 54, "ProgLang"), (58, 62, "ProgLang"), (129, 133, "ProgLang"), (135, 136, "ProgLang"), (138, 141, "ProgLang"), 
                      (143, 149, "ProgLang"), (151, 153, "ProgLang"), (169, 178, "Skill"), (186, 190, "ProgLang"), (220, 225, "ProgLang"), 
                      (246, 255, "ProgLang"), (266, 270, "ProgLang"), (276, 283, "ProgLang"), (376, 383, "ProgLang"), (385, 391, "ProgLang"), 
                      (393, 403, "ProgLang"), (405, 414, "ProgLang"), (434, 439, "Skill"), (499, 512, "Skill")]}),

    ("Experience with cloud technologies and computing platforms/infrastructure such as AWS, Azure, IBM Cloud, and SoftLayer Experience with REST APIs, using Kubernetes to build and run applications. Experience with Hardware Security Modules (HSMs) and working knowledge of managing digital keys, performing encryption and decryption functions for digital signatures, use of strong authentication and other cryptographic functions. Experience with network/application protocols, load balancers and proxy servers. Familiar with security concepts and technology in the cloud and software development Bachelor’s Degree in appropriate field of study or equivalent work experience",
        {"entities": [(82, 85, "ProgLang"), (87, 92, "ProgLang"), (94, 103, "ProgLang"), (135, 144, "Skill"), (152, 162, "ProgLang"), 
                      (442, 471, "Skill"), (571, 591, "Skill")]}),

    ("Bachelor’s or Master’s degree in Computer Science or related field, or equivalent in experience 3+ years of experience in building software products for external customer use in Java, C, C++, C# or some other object-oriented language with evidence of exceptional ability Experience with writing SQL-based applications such as MySQL or MS SQL or Oracle Experience working with Linux or similar O/Ss Experience building scalable applications (Web applications or back-end services) Detail-oriented, can identify and fix own bugs, and write quality code that runs efficiently Passionate, positive, can-do attitude and can adapt to any challenge and willing to take ownership of problems and brings issues to full resolution Enjoy working in a team that follows agile practices and code reviews and CI/CD",
        {"entities": [(0, 10, "Education"), (14, 22, "Education"), (33, 49, "Education"), (178, 182, "ProgLang"), (184, 185, "ProgLang"), 
                      (187, 190, "ProgLang"), (192, 194, "ProgLang"), (295, 298, "ProgLang"), (326, 331, "ProgLang"), (376, 381, "ProgLang"), 
                      (758, 763, "Skill"), (795, 800, "Skill")]}),

    ("You have 4+ years of experience designing features for web-based applications from concept, and have a portfolio of web design samples to show your work. You have strong user empathy. You see yourself as a user advocate and want to deeply understand your users' needs. You're hungry for data on user behavior and love talking to users, hearing their feedback, and adjusting your approach. You have a good balance between confidence in your design choices and ability to seek others' ideas, added context, and feedback to iterate collaboratively. Your designs are well-reasoned and you're able to clearly communicate the whys behind your design choices as well the details of how interactions should work to product managers and developers. You have a strong attention to detail and drive towards consistency. You have some familiarity with HTML and CSS and interest in building components to improve both the user and developer experience. You desire to work in a fast changing, dynamic startup environment. You eat ambiguous projects for breakfast and turn them into intuitive user experiences. You are a self-starter and finisher.",
        {"entities": [(116, 126, "Skill"), (170, 182, "Skill"), (758, 777, "Skill"), (840, 844, "ProgLang"), 
                      (849, 852, "ProgLang"), (1106, 1118, "Skill")]}),

    ("Familiarity and interest in SQL, Tableau, and BI tools (things our users leverage heavily). Excited about data visualizations and when and how to use data visualizations. Experience with HTML, CSS, Figma, Zeplin, Storybook, and/or component libraries. A background in SaaS, CDP, and/or web-based marketing tools. 2+ years experience working in a fast-paced environment. Experience working with engineering teams and related tools (e.g., git, docker).",
        {"entities": [(28, 31, "ProgLang"), (33, 40, "ProgLang"), (46, 54, "ProgLang"), (106, 125, "Skill"), (187, 191, "ProgLang"), 
                      (193, 196, "ProgLang"), (198, 203, "ProgLang"), (205, 211, "ProgLang"), (213, 222, "ProgLang"), (268, 272, "ProgLang"), 
                      (274, 277, "ProgLang"), (437, 440, "ProgLang"), (442, 448, "ProgLang")]}),

    ("Part of the Domo promise is to give customers leverage “in record time.” This is also how our product team operates; For candidates who want to excel in our fast-paced environment we expect: Independent, self-starters who are motivated by challenges, driven by natural curiosity, and interested in solving big problems; Experience designing digital products, on the web, mobile, and even native applications (preference to experience designing enterprise software); Comfortable with the ‘full-stack’ of design skills, from needs finding, wireframing, designing, writing, prototyping, and presenting experiences; Effective at understanding and breaking down complex ideas into simple ones, and combining them into novel solutions; Collaborates well with both designers and stakeholders, and communicates effectively when presenting and in writing; Expertly navigates disagreements between usability and feasibility, and resolves roadblocks without alienating stakeholders; Familiarity with designing and using design systems in tools like Sketch, Figma, and Adobe XD; Learns quickly by jumping into the deep end, is unafraid of asking questions, and doesn’t quit before mastering both design skill and product know-how.",
        {"entities": [(204, 216, "Skill"), (488, 498, "Skill"), (538, 549, "Skill"), (331, 340, "Skill"), 
                      (562, 569, "Skill"), (571, 582, "Skill"), (888, 897, "Skill"), (1038, 1044, "ProgLang"), (1046, 1051, "ProgLang"), 
                      (1057, 1065, "ProgLang")]}),

    ("Comfortable using Excel or Google Sheets, including writing formulas; Experience conducting research, analyzing data, and making data-driven decisions (either as a designer or in other related business roles); Passionate about data and data visualization (some familiarity with data languages like SQL, Python/R, or Javascript/D3 is a plus); Capable of presenting ideas and prototypes in animated format (familiarity with Adobe Premiere Pro and After Effects is a plus); Regularly troubleshoot your own software problems, people come to you as the ‘techie’ who knows how to find the solution or build a creative workaround.",
        {"entities": [(18, 23, "ProgLang"), (27, 40, "ProgLang"), (81, 100, "Skill"), (129, 150, "Skill"), (236, 254, "Skill"), 
                      (298, 301, "ProgLang"), (303, 309, "ProgLang"), (310, 311, "ProgLang"), (316, 326, "ProgLang"), (327, 329, "ProgLang"), 
                      (422, 440, "ProgLang"), (445, 458, "ProgLang"), (481, 493, "Skill")]}),

    ("Master’s or PhD Degree in Information Technology, Computer Science, or a related discipline strongly preferred Understanding of high performance ML algorithms (SVM’s, gradient boosted decision trees, deep neural networks, etc.) and with toolkits/frameworks from R or Python, preferably with cloud deployment experience Experience in building models that have been deployed to production (e.g. in marketing, demand forecasting, or supply chain) Experience using big data batch and streaming tools (Spark, AWS tools, Hadoop) Self-starter with high level of intellectual curiosity Strong analytical and hardworking problem solving mindset Able to build a sense of trust and rapport that creates a comfortable & effective workplace",
        {"entities": [(0, 8, "Education"), (12, 15, "PhD"), (26, 48, "Education"), (50, 66, "Education"), (145, 147, "Skill"), 
                      (262, 263, "ProgLang"), (267, 273, "ProgLang"), (291, 307, "Skill"), (396, 405, "Skill"), (407, 425, "Skill"), 
                      (461, 469, "Skill"), (497, 502, "ProgLang"), (504, 507, "ProgLang"), (515, 521, "ProgLang"), (612, 635, "Skill")]}),

    ("Ingest and aggregate data from both internal and external data sources to build our extraordinary datasets Build large-scale batch and real-time data pipelines with data processing frameworks like Scio, Storm, or Spark on the Google Cloud Platform Demonstrate standard methodologies in continuous integration and delivery Help drive optimization, testing, and tooling to improve data quality Collaborate with other engineers, ML specialists, and partners, taking learning and leadership opportunities that will arise every day Work in cross functional agile teams to continuously experiment, iterate, and deliver on new product objectives. Work from our offices in New York, with some travel to other Spotify office locations",
        {"entities": [(197, 201, "ProgLang"), (203, 208, "ProgLang"), (213, 218, "ProgLang"), (226, 247, "ProgLang"), (286, 308, "Skill"), 
                      (392, 403, "Skill"), (426, 428, "Skill"), (552, 557, "Skill")]}),

    ("You know how to work with high volume heterogeneous data, preferably with distributed systems such as Hadoop, BigTable, and Cassandra You have experience with one or more higher-level JVM-based data processing frameworks such as Beam, Dataflow, Crunch, Scalding, Storm, Spark, or something we didn’t list- but not just Pig/Hive/BigQuery/other SQL-like abstractions You are knowledgeable about data modeling, data access, and data storage techniques You understand the value of teamwork within teams, are excellent communicators, and can build relationships with a diverse set of partners Machine Learning experience is a plus Experience with data ingestion via API and/or web scraping/crawling (e.g. Selenium, BeautifulSoup) at scale preferred Experience with Google Cloud Platform",
        {"entities": [(102, 108, "ProgLang"), (110, 118, "ProgLang"), (124, 133, "ProgLang"), (229, 233, "ProgLang"), (235, 243, "ProgLang"), 
                      (245, 251, "ProgLang"), (253, 261, "ProgLang"), (263, 268, "ProgLang"), (270, 275, "ProgLang"), (319, 322, "ProgLang"), (323, 327, "ProgLang"), (328, 336, "ProgLang"), (393, 406, "Skill"), (425, 437, "Skill"), (477, 485, "Skill"), (588, 604, "Skill"), (642, 656, "Skill"), (661, 664, "Skill"), (672, 684, "Skill"), (700, 708, "ProgLang"), (710, 723, "ProgLang"), (760, 781, "ProgLang")]}),

    ("Have the ability to work independently to plan, lead, and implement complex, accurate, and timely analysis Are a master with querying languages (e.g. SQL), and applying statistics to solve industry problems (i.e. A/B testing) You have experience communicating the results of analyses to influence strategy decisions You have an understanding of growth and marketing in the Fintech space You have worked in a startup or technology company to deliver a world class service to users",
        {"entities": [(20, 38, "Skill"), (150, 153, "ProgLang"), (169, 179, "Skill"), (213, 224, "Skill"), (297, 305, "Skill"), (345, 351, "Skill"), (356, 365, "Skill")]}),

    ("Bachelor's degree required 3+ years’ experience working in incident response, network investigations, tool development, and/or other IT related fields tied to information security Working knowledge of the following systems: Endpoint protection systems Database formats (SQL, SQLite, AGC, ODB, etc) Memory Analysis System logs from servers and network devices DHCP, AD, 802.1x, NAT, Web Proxy, and VPN logs - Passive DNS SIEM/Log Management systems (Splunk preferred) - Encase/Blacklight/Axiom/UFED or similar Scripting (Bash/Powershell/Python or similar) Experience investigating complex technical security incidents, highly sensitive employee matters, and insider threat assessment and management is required Independently leverage technical tools and techniques to conduct and support Security Intelligence investigations Working knowledge of object-oriented programming in order to customize open- source scripts and troubleshoot community tools Experience in analyzing complex data sets to detect patterns and anomalies Quickly learn and implement new technologies to further organizational goals Experience in conducting and overseeing complex, global, investigations is preferred Demonstrated knowledge of corporate investigation strategies utilizing technical forensic capabilities and data Demonstrated experience of regular communication at executive level within a global corporate environment Proven track record managing multiple complex projects simultaneously, and focusing on critical priorities with little or no supervision",
        {"entities": [(0, 10, "Education"), (59, 76, "Skill"), (102, 118, "Skill"), (159, 179, "Skill"), (270, 273, "ProgLang"), (275, 281, "ProgLang"), 
                      (298, 313, "Skill"), (520, 524, "ProgLang"), (525, 535, "ProgLang"), (536, 542, "ProgLang"), (665, 682, "Skill"), (920, 932, "Skill"), 
                      (994, 1009, "Skill"), (1222, 1246, "Skill"), (1267, 1288, "Skill"), (1333, 1346, "Skill"), (1424, 1432, "Skill")]})

]

In [None]:
# len(NICO_TRAIN_DATA)

20

## McCoy Train Data

In [None]:
TRAIN_DATA = [
    ("We believe that the way consumers and businesses interact with their finances will drastically improve in the next few years. Plaid's goal is to enable this shift by building the tools and infrastructure that allow developers to create the next generation of financial services applications. As a software engineer at Plaid, you'll be an integral part of building and scaling the APIs that help us all achieve this vision.",
     {"entities": [(126, 131, "ORG"),(318,323, "ORG"), (380,383, "Skill")]}),
    
    ("We’re architected around decoupled services. You could be working on our internal distributed systems framework that holds them all together, or building out the newest version of our API. Most of our back-end core systems are written in Go, but that's not a prerequisite. We pick the right tool for the job and have systems in Go, TypeScript, and Python.",
     {"entities": [(184, 187, "Skill"), (238, 239, "ProgLang"), (328, 240, "ProgLang"), (332, 342, "ProgLang"), (348, 354, "ProgLang")]}),

    ("What You Bring To The Table 1-3 years professional software development experience with a modern web framework (Rails or Django preferred) Team player with excellent written and spoken English communication skills Bachelor's degree in Computer Science, or ample real-world experience Previous experience with relational database design (PostgreSQL preferred) Previous experience with consuming and/or designing RESTful service APIs Expertise with distributed version control systems (git preferred)",
    {"entities": [(112,117,"ProgLang"), (121,127,"Skill"), (214,231,"Education"), (309, 335, "Skill"), (337, 347, "Skill"), (401, 430, "Skill"), (459, 475, "Skill"), (484, 487, "ProgLang")]}),

    ("Grubhub is an equal opportunity employer. We evaluate qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, and other legally protected characteristics.",
     {"entities": [(0,7,"ORG")]}),

    ("At Instabase, we're passionate about building software to advance the state of the art in computing. We've built a fearlessly experimental, customer-obsessed team who are making discoveries to fundamentally change how people build and consume business applications. Today, we're partnering with the world's leading companies to transform how they use data and technology. If these challenges excite you, we'd love to hear from you!",
     {"entities": [(3,12,"ORG")]}),

    ("""What you'll do: Very strong scripting languages skills [Python] Very strong data engineering skills [data parsing, web scraping, data transformation, data integration, etc.] Ability to parse structured data from messy text [html, word document, images (ocr), pdf, etc.] Ability to integrate data from various data sources Knowledge of regex, and data parsing/cleaning libraries (pandas, beautiful-soup, etc.) Knowledge of data analysis tools (for example: pandas), NLP tools (for example: spacy, stanford NLP, etc),""",
     {"entities": [(56, 62, "ProgLang"), (101, 113, "Skill"), (115, 127, "Skill"), (129, 148, "Skill"), (150, 166, "Skill"),
                   (335, 340, "ProgLang"), (379, 385, "ProgLang"), (387, 401, "ProgLang"), (456, 462, "ProgLang"), (489, 494, "ProgLang"), (505,508,"Skill")]}),
              
    (""", Image Processing tools (for example: opencv), and Machine Learning tools (for example: scikit-learn, tensorflow, etc.) Ability to quickly script a working solution for a given problem Creative problem-solving skills A results-focused mindset Requirements: B.S. in Computer Science (or equivalent training) Languages: Python Experience in "wrangling" datasets Web Fundamentals, APIs, REST, and GraphQL Database Systems and SQL""",
    {"entities":[(2,18,"Skill"), (39,45,"ProgLang"), (52,68,"Skill"), (89,101,"ProgLang"), (103,113,"ProgLang"),
                 (258,307,"Education"), (319,325,"ProgLang"), (379,383,"Skill"), (385,389,"Skill"), (395,402,"ProgLang"), (424,427,"ProgLang")]}),

    ("We'd Love To See: A passion for designing and building software to help solve problems Front-end development experience with React (JavaScript) Familiarity/interest in learning our technology stack: Python 3 (flask, celery) JavaScript (React, D3.js) Experience working with graph data Apache, PostgreSQL, SOLR, Docker Linux Does this sound like you?",
     {"entities":[(87,108,"Skill"), (125,130,"ProgLang"), (132,142,"ProgLang"), (199,205,"ProgLang"), (209,214,"ProgLang"),
                  (216,222,"ProgLang"), (224,234,"ProgLang"), (236,241,"ProgLang"), (243,248,"ProgLang"), (285,291,"ProgLang"),
                  (293,303,"ProgLang"), (305,309,"ProgLang"), (311,317,"ProgLang"), (318,323,"ProgLang")]}),
              
    ("Drive and code solutions with a creative approach. Create simple solutions for complex business processes, integrating best of breed tools. Work with leadership and others to find the right solutions. For immediate consideration, please email a copy of your resume to Requirements - Experience caring for the technical needs of sites, apps and integrations with high volume traffic - Able to move from quickly from draft mockups into various web languages - Deep understanding of C - Angular JS experience desired - ASP.NET experience - Understanding of HTML and CSS - A real enthusiasm for developing end-to-end solutions that power a positive user experience - Mastery of graphic illustration and user interface design, usability principles and mobile strategies - Strong knowledge of compatibility and cross-browser issues - Hands-on experience with server-side languages - Experience with both front-end and back-end technologies in a B2B or B2C web production environment Robert Half Technology matches IT professionals with remote or on-site jobs on a temporary, project or full-time basis.",
     {"entities":[(421,428,"Skill"), (480,481,"ProgLang"), (484,494,"ProgLang"), (516,523,"ProgLang"),
                  (554,558,"ProgLang"), (563,566,"ProgLang"), (699,720,"Skill"), (722,742,"Skill")]}),

    ("Qualifications Bachelor's Degree in computer science or information technology; Master's Degree a plus. Minimum of two years of experience developing and deploying financial and asset management models in MS Excel.",
     {"entities":[(15,32,"Education"),(80,95,"Education")]}),
    
    ("Experience with modern web technologies and techniques such as HTML5, CSS3, CSS frameworks and CSS pre-processors. Knowledge of responsive design and general web standards Knowledge of HTML email creation using tables Debugging and diagnose problems using DevTools Experience with design tools like Figma or Adobe Photoshop Ability to work under tight deadline",
     {"entities":[(63,68,"ProgLang"),(70,74,"ProgLang"), (76,79,"ProgLang"), (95,98,"ProgLang"), (185,189,"ProgLang"), (299,304,"ProgLang"), (314,324,"ProgLang")]}),

    ("An understanding of web services Experience with multiple programming languages (such as, Java, C++, Ruby, Python, Perl, etc.) Excellent written and verbal communication skills",
     {"entities":[(90,94,"ProgLang"), (96,99,"ProgLang"), (101,105,"ProgLang"), (107,113,"ProgLang"), (115,119,"ProgLang"),]}),

    ("Responsibilities: - Develop web applications to quickly and clearly present complex genomic data to genomic researchers - Design, develop, and test API to enable third party/internal database integrations - Implement new, attractive layouts and themes - Improve the UI/UX - Communicate/collaborate with our scientists - Troubleshoot web applications - Maintain rigorous programming practices in a fast-moving environment Requirements: - BS in computer science or related field - 2+ years of experience in web-based user-interface design and development - JavaScript frameworks (e.g. React, Vue.js, Angular, Bootstrap etc.) - UI design using HTML, CSS etc. - API design and development - SQL, database - Familiarity with AWS/ GitLab / WordPress - Git version control - Successful candidate is expected to start as soon as possible Preferred but not required skills: - Full stack web developer who can also contribute to our backend is a significant plus (Front end is a must) - Background in bioinformatics and knowledge of bioinformatics tools/database (GATK, NCBI, UCSC, Ensembl) - Graphical design/data visualization - Listed examples of open-source work (e.g. GitHub)",
     {"entities":[(148,151,"Skill"), (266,268,"Skill"), (269,271,"Skill"), (583,588,"ProgLang"), (590,596,"ProgLang"),
                   (598,605,"ProgLang"), (607,616,"ProgLang"), (625,634,"Skill"), (641,645,"ProgLang"), (647,650,"ProgLang"),
                   (658,684,"Skill"), (687,690,"ProgLang"), (720,723,"ProgLang"), (725,731,"ProgLang"),
                   (734,743,"ProgLang"), (746,749,"ProgLang"), (1083,1099,"Skill"), (1100,1118,"Skill")]}),
              
    ("Bachelor’s Degree in computer science, information science, or a related field Four years' experience performing progressively more complex and responsible tasks involving development and support of enterprise applications and services Proficient C#, .NET and .Net framework with recent experience developing Web APIs Experience in cloud application design such as Amazon Web Services (AWS) Experience with serverside and clientside (such as: Javascript and Json, NodeJS, React, Lambda, Angular, Amazon SQS) Experience integrating Web applications and services with middleware services Experience within a software development lifecycle, including DevOps practices Experience with code versioning, branching, and release methodologies Experience with SQL, query design and data normalization Strong acumen for software testing; commitment to quality Understanding of secure design and coding practices. Excellent problem-solving skills; able to design creative and pragmatic solutions Demonstrated ability to quickly learn and apply new technologies and skills Demonstrated written/oral communication skills, user liaison skills and personal interaction abilities An effective team player who enjoys collaboration",
     {"entities":[(0,17,"Education"), (247,249,"ProgLang"), (251,255,"ProgLang"), (260,264,"ProgLang"),(313,317,"Skill"), 
                 (386,389,"ProgLang"), (443,453,"ProgLang"), (458,462,"ProgLang"), (464,470,"ProgLang"), (472,477,"ProgLang"), 
                 (479,485,"ProgLang"), (487,494,"ProgLang"), (496,506,"ProgLang")]}),
    
    ("Document code solutions and ensure supportability Top Skills Needed: BS/MS degree in Computer Science, Engineering or a related subject or equivalent experience At least 0-3 years’ experience as a full-stack developer Knowledge of all phases of software development including design, coding, testing, debugging, implementation, and support",
     {"entities":[(69,81,"Education")]}),

    ("biographical data and prospect information. Perform Other Responsibilities As Assigned. Required Qualifications* A bachelor’s degree in business information systems or computer science (or equivalent professional experience). 4+ years of experience as a data analyst, business systems analyst or similar experience.",
     {"entities":[(115,132,"Education")]}),

    ("Participates within a team of scientists to foster a culture of scientific excellence. Requirements Ph.D. or Masters degree in Computational Biology, Statistical Genetics, Computer Science, Engineering, Physics, Math, or other relevant scientific discipline or equivalent experience required Must have demonstrated experience",
     {"entities":[(100,105,"Education"), (109,123,"Education")]}),

    ("Summarize and present conclusions and solutions Communicate complex analyses clearly to all audiences as requested Requirements Bachelor’s degree in a quantitative field such as computer science, statistics, applied mathematics, operations research, ",
     {"entities":[(128,145,"Education")]}),

    ("Basic Qualifications Ph.D., M.S. or Bachelors degree in Statistics, Economics, Machine Learning, Operations Research, or other quantitative fields.",
     {"entities":[(21,25,"Education"),(28,32,"Education"),(36,52,"Education"),(56,66,"Skill"),(79,95,"Skill"),(108,116,"Skill")]}),

    ("(If M.S. degree, a minimum of 1+ years of industry experience required and if Bachelor's degree, a minimum of 2+ years of industry experience as a Data Scientist or equivalent). Knowledge of underlying mathematical foundations of statistics, machine learning, optimization, economics, and analytics. Knowledge of experimental design and analysis Experience with exploratory data analysis, statistical analysis and testing, and model development Ability to use a language like Python or R to work efficiently at scale with large data sets. Proficiency in languages like SQL, R, and Spark.",
     {"entities":[(4,15,"Education"),(78,95,"Education"),(476,482,"ProgLang"),(485,488,"ProgLang"),(569,572,"ProgLang"),(574,575,"ProgLang"),(581,586,"ProgLang")]}),

    ("Preferred Qualifications Ph.D. in Statistics, Computer Science, Economics, Operations Research, Physics, or other quantitative fields Hands-on experience with experimental design, statistical methods, and causal inference Experience working in a technical role in a marketplace environment or experience with NLP Experience leading a data science team ",
     {"entities":[(25,29,"Education"), (309,312,"Skill")]}),

    ("""Applicants must be willing to relocate to one of the major geographic areas where we have significant customer accounts and/or travel may be required Cognizant will not sponsor H-1B or other U.S. work authorization, or lawful permanent residence (otherwise known as a “Green Card”) for these roles""",
    {"entities":[(165,181,"NonSponsor")]}),

    ("Bounteous is willing to sponsor eligible candidates for employment visas.",
    {"entities":[(0,9,"ORG"),(10,72,"Sponsor")]}),

    ("Pythian will not relocate, sponsor, or file petitions of any kind on behalf of a foreign worker to gain a work visa, become a permanent resident based on a permanent job offer, or to otherwise obtain authorization to work.",
    {"entities":[(0,7,"ORG"),(8,34,"NonSponsor")]}),

    ("Long Beach Transit does not sponsor H-1B or other related work visas.",
    {"entities":[(0,18,"ORG"),(19,68,"NonSponsor")]}),

    ("YieldX currently does not sponsor H1B work visas.",
    {"entities":[(0,6,"ORG"),(17,48,"NonSponsor")]}),

    ("Unfortunately, our client is unable to sponsor for this position.",
    {"entities":[(29,46,"NonSponsor")]}),

    ("New Balance will not sponsor applicants for work visas",
    {"entities":[(0,11,"ORG"),(12,54,"NonSponsor")]}),

    ("Position is open to H1b visa candidates (group is H1b cap exempt) and positions are 2021 J1 visa eligible. Employer will sponsor green card process if needed",
     {"entities":[(9,39,"Sponsor"),(92,105,"Sponsor"),(116,139,"Sponsor")]}),

    ("NO RECRUITERS, NO H1B, NO F1, Sorry, we will not transfer or sponsor visas.",
     {"entities":[(15,21,"NonSponsor"),(23,28,"NonSponsor"),(40,74,"NonSponsor")]}),

    ("Only US Citizens or Green Card Holder Apply.   NO H1B and EAD Candidates will be accepted.",
     {"entities":[(47,61,"NonSponsor")]}),

    ("This is a Full Time Python Developer Role in NYC that is paying anywhere from $140,000- $160,000 yr. There is no sponsorship for this role, sorry no H1b candidates or 3rd parties",
     {"entities":[(110,138,"NonSponsor"), (146,163,"NonSponsor")]}),

    ("Anatomage is an Equal Employment Opportunity employer. We do not offer H1B Sponsorship at this time.",
     {"entities":[(58,86,"NonSponsor")]}),

    ("Must be legally authorized to work in country of employment without sponsorship for employment visa status (e.g., H1B status).",
     {"entities":[(60,94,"NonSponsor")]}),

    ("At this time we are unable to sponsor H1B candidates",
     {"entities":[(20,41,"NonSponsor")]}),

    ("This position does not qualify for VISA sponsorship.",
     {"entities":[(14,51,"NonSponsor")]}),

    ("are willing to sponsor for",
     {"entities":[(0,11,"Sponsor")]}),

    ("are willing to sponsor h1b for",
     {"entities":[(4,26,"Sponsor")]}),

    ("can sponsor for the right candidate",
     {"entities":[(4,22,"Sponsor")]}),

    ("can sponsor h1b for the right candidate",
     {"entities":[(0,15,"Sponsor")]}),

    ("can sponsor H-1B for the right candidate",
     {"entities":[(0,16,"Sponsor")]}),

    ("Applicants for employment in the US must have work authorization that does not now or in the future require sponsorship of a visa for employment authorization in the United States and",
     {"entities":[(36,119,"NonSponsor")]}),

    ("using data visualization tool such as Tableau, Spotfire, or Qlikview",
     {"entities":[(38,45,"ProgLang"),(47,55,"ProgLang"),(60,68,"ProgLang")]}),
]

In [None]:
# len(TRAIN_DATA)

43

## Merging both train data sets

In [None]:
## merging both training data sets

TRAIN_DATA = NICO_TRAIN_DATA + TRAIN_DATA

In [None]:
len(TRAIN_DATA)

63

In [None]:
def NERS(model=None, output_dir=None, n_iter=100, v=False):
    """Load the model, set up the pipeline and train the entity recognizer."""
    from pathlib import Path

    #### https://spacy.io/usage/training
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
            nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
            if(v):print("Losses", losses)

    # test the trained model
    for text, _ in TRAIN_DATA:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        #print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in TRAIN_DATA:
            doc = nlp2(text)
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            #print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
    
    return(nlp)

custom_NLP = NERS(output_dir="/content/drive/Shareddrives/SI650 Project [Info Retrieval]/entity-model", n_iter=400, v=1)

Created blank 'en' model
Losses {'ner': 2850.321544557053}
Losses {'ner': 1004.8218245740281}
Losses {'ner': 926.2826374433935}
Losses {'ner': 886.3324064547196}
Losses {'ner': 849.7777309243102}
Losses {'ner': 780.1328689345974}
Losses {'ner': 787.9446716988459}
Losses {'ner': 732.5165140746394}
Losses {'ner': 683.18720869557}
Losses {'ner': 637.8871305722278}
Losses {'ner': 627.1880224705674}
Losses {'ner': 612.7527940311702}
Losses {'ner': 639.6467996179126}
Losses {'ner': 567.2118845491786}
Losses {'ner': 656.5105703847948}
Losses {'ner': 681.2907164139324}
Losses {'ner': 646.3340620186718}
Losses {'ner': 660.0491835115708}
Losses {'ner': 645.4046791047513}
Losses {'ner': 560.2575921235548}
Losses {'ner': 529.9410412268771}
Losses {'ner': 544.1543288267578}
Losses {'ner': 528.5738849963236}
Losses {'ner': 554.6020098459051}
Losses {'ner': 512.2723697244219}
Losses {'ner': 549.1842005116978}
Losses {'ner': 505.0735536510276}
Losses {'ner': 499.38802379931167}
Losses {'ner': 493.1792

In [None]:
from spacy import displacy

colors = {"SKILL": "orange", "PROGLANG": "violet", "ORG": "cyan", "EDUCATION":"green"}
options = {"ents": ["SKILL", "PROGLANG", "ORG", "EDUCATION"], "colors": colors}

displacy.render(custom_NLP(TRAIN_DATA[4][0]), jupyter=True, style='ent', options=options)

***
<b>grab the data for eventual testing</b>

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

import pandas as pd
Full_DF = pd.read_csv("/content/drive/Shareddrives/SI650 Project [Info Retrieval]/data/FULL_DF.csv")

Mounted at /content/drive


In [None]:
displacy.render(custom_NLP(Full_DF.description[0].replace('\n\n',' ')), jupyter=True, style='ent', options=options)

In [None]:
# custom_NLP(Full_DF.description[0].replace('\n\n',' ')).ents

(particular promises,
 Q,
 Sequoia,
 InQTel,
 Machine Learning,
 will perform research,
 machine learning,
 machine learning,
 deployment,
 communication will,
 Python,
 C++,
 essential,
 caffe2,
 CUDA,
 scikit-learn,
 Tensorflow,
 model compression,
 compilation,
 deep learning,
 deep learning,
 IoT,
 below,
 essential :,
 ML,
 machine learning,
 Q,
 America,
 recruitment,
 do not accept unsolicited)