# Method 1, Incremental Training
## Using SpaCy CLI for training and evaluation
These training steps use two stages.
Stage 1 trains a base model `en_core_web_md` first with ECHR data, then evaluates with ECHR test data. After this stage 'Catastrophic Forgetting' can be observed by the missing entities from the base model.
Stage 2 resumes the training process, picking up from Stage 1 and uses an annotated dataset which was run against a SpaCy model. It contains entities from the original model plus a few new entities that was added by Presidio (ADDRESS, PHONE, URL, etc)

In [None]:
#!python -m spacy download en_core_web_md

In [99]:
!python -m spacy benchmark accuracy --gpu-id=0 "en_core_web_md" "../../data/annotated/dev.spacy"

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK      100.00
TAG      -     
POS      -     
MORPH    -     
LEMMA    -     
UAS      -     
LAS      -     
NER P    31.29 
NER R    49.38 
NER F    38.31 
SENT P   -     
SENT R   -     
SENT F   -     
SPEED    8078  

[1m

                  P       R       F
CARDINAL       0.00    0.00    0.00
GPE           44.95   67.34   53.91
ORG            9.57   13.56   11.22
LAW            0.00    0.00    0.00
NORP           0.00    0.00    0.00
PERSON        18.41   20.68   19.48
DATE          78.61   91.17   84.42
ORDINAL        0.00    0.00    0.00
DEM            0.00    0.00    0.00
FAC            0.00    0.00    0.00
TIME           0.00    0.00    0.00
QUANTITY       0.00    0.00    0.00
MONEY          0.00    0.00    0.00
PERCENT        0.00    0.00    0.00
LANGUAGE       0.00    0.00    0.00
PRODUCT        0.00    0.00    0.00
LOC            0.00    0.00    0.00
WORK_OF_ART    0.00    0.00    0.00
EVENT       

In [77]:
# baseline test against pre-trained model to see which entities it picks up from resume text.

import spacy
import re
activated = spacy.prefer_gpu()
nlp = spacy.load('en_core_web_md')

text = ['''SIDDHARTH RAGHUVANSHI                                Roll No. 06CS3025                                            DOB: 08/08/1988
Email: siddharth.iitkharagpur@gmail.com                                                                                          Mobile No.:   +91 9932584135
Degree/Certificate
Dual Degree[B. Tech (H) + M. Tech]
(Computer Science & Engineering)
Class XII:  C.B.S.E.
Class X:   C.B.S.E.
ACADEMIC ACHIEVEMENTS
Institute/ School, City
Indian Institute of Technology, Kharagpur
Central Hindu School, Varanasi
St. Atulanand Convent School, Varanasi
CGPA/ %  Completion
8.26/10
86.0%
90.8%
2011
2005
2003
Competitive
Examinations
  All India Rank 116 in AIEEE, 2006 among 470,000 students, State Rank 8 in Uttar Pradesh.
  All India Rank 119 in 7th National Science Olympiad, 2005.
  All India Rank 22 in All India Level Mathematics & Science Test organized by Central Institute  for
Proficiency in English Language (CIPEL).
Scholastic
Achievements
  National top 1% out of 26968 candidates appeared in National Standard Examination in Physics’05
  Receiving CBSE Merit Scholarship for the past 4 years.
ACADEMIC PROJECTS
M. Tech Project                                                                                      IIT Kharagpur                                           May’10-Nov’10
•
•
Studied the performance of text indexing algorithms on Hadoop MapReduce architecture.
Future work includes implementing more efficient indexing and retrieval techniques in MapReduce for distributed parallel
computing.
B. Tech Project                                                                                        IIT Kharagpur                                           Aug’09-May’10
  Developed a software with can handle all sorts of query related to geographical information extracted from maps.
  Developed a client interface which can fetch data from different incompatible geospatial web services and make that data
compatible for resolving queries.
Integrated my framework engine with different underlying heterogeneous spatial databases.

Static Instrumentation Of Java Programs                                          IIT Kharagpur                                                   May’08
  Developed a program using Byte Code Engineering Library to do automated testing of java program at byte code level.
WORK EXPERIENCE / INTERNSHIP
Extreme Blue Internship Program                                                                   ISL, IBM, Pune, India                                       May’09 – July’09
Business
Perspective
Technical
Perspective
 Achievements
  Conducted survey in Pune region on the current home delivery status of organized retails

Proposed and implemented a solution on how to increase home delivery sales in order to compete with the
localized general (kirana) stores

Built an independent Home Delivery module on Java EE platform using open standards such as XML and
Web Services
Integrated the Home Delivery module with IBM WebSphere Commerce.

  Received highest grade 10/10 in summer internship evaluation at IIT Kharagpur, 2009.
RELEVANT COURSES TAKEN
  Machine learning
  Algorithms-I
  Algorithms-II
Information Retrieval

  Distributed Systems

Probability and Statistics
POSITION OF RESPONSIBILITY

Student coordinator of IIT Kharagpur Student Counselling Service.
  Student member of team that conceptualized and publicized Counselling Centre in IIT Kharagpur after 5 successive suicides
in the campus within a span of 6 months in between Feb’09 and Jul’09.
  More than 100 students are counselled every month.
  No mishaps in the campus as of Sep’10 after the establishment of the centre.
  Went through Gate Keepers Training to identify behavioral change in a person.
  Managed  the  systems  team  of  Bitwise-2010,  an  international  algorithmic  intensive  programming  contest  leading  to  the
participation of 3000 teams across 75 countries.

Family Sub-head of accommodation team in Spring Fest, 2008.
  Head boy of my Senior Secondary School (Central Hindu School).
e
EXTRA CURRICULAR ACHIEVEMENTS
  Member of Silver winning team in inter hall OPENSOFT Competition in the session 2007-08.
  National Sports Organization: Among Top 30 students in Lawn Tennis Team at IIT Kharagpur’06. ''']

# normalize whitespace as per https://github.com/explosion/spaCy/discussions/10243
r = []
for t in text:
    r.append(re.sub(r"\s+", " ", t))

for doc in nlp.pipe(r):
    print([(ent.text, ent.label_) for ent in doc.ents])

[('SIDDHARTH RAGHUVANSHI Roll', 'PERSON'), ('Computer Science & Engineering', 'ORG'), ('City Indian Institute of Technology', 'ORG'), ('Kharagpur Central Hindu School', 'ORG'), ('Varanasi', 'GPE'), ('Varanasi', 'GPE'), ('86.0%', 'PERCENT'), ('90.8%', 'PERCENT'), ('2011 2005', 'DATE'), ('2003', 'DATE'), ('116', 'CARDINAL'), ('AIEEE', 'GPE'), ('2006', 'DATE'), ('470,000', 'CARDINAL'), ('State', 'ORG'), ('8', 'CARDINAL'), ('Uttar Pradesh', 'GPE'), ('119', 'CARDINAL'), ('7th', 'ORDINAL'), ('National Science Olympiad, 2005', 'EVENT'), ('22', 'CARDINAL'), ('Central Institute for Proficiency', 'ORG'), ('English', 'LANGUAGE'), ('CIPEL', 'PERSON'), ('1%', 'PERCENT'), ('26968', 'CARDINAL'), ('National Standard Examination', 'ORG'), ('the past 4 years', 'DATE'), ('ACADEMIC PROJECTS M. Tech Project', 'ORG'), ('IIT', 'ORG'), ('Kharagpur', 'GPE'), ('Hadoop', 'ORG'), ('B. Tech Project', 'ORG'), ('IIT', 'ORG'), ('IIT Kharagpur May’08 \uf0a7', 'ORG'), ('Byte Code Engineering Library', 'ORG'), ('INTERNS

In [78]:
!python -m spacy init fill-config config/base_config_md.cfg config/config_md.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config/config_md.cfg
You can now add your data and train your pipeline:
python -m spacy train config_md.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [79]:
!python -m spacy debug data config/config_md.cfg

[1m
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: en
Training pipeline: tok2vec, ner
Components from other pipelines: ner, tok2vec
1141 training docs
127 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[1m
[38;5;4mℹ 1704123 total word(s) in the data (32105 unique)[0m
[38;5;4mℹ 20000 vectors (514157 unique keys, 300 dimensions)[0m
[38;5;3m⚠ 44850 words in training data without vectors (3%)[0m
[1m
[38;5;4mℹ 19 label(s)[0m
0 missing value(s) (tokens with '-' label)
[38;5;3m⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'NORP',
'WORK_OF_ART', 'EVENT', 'FAC', 'MONEY', 'LOC', 'ORDINAL', 'LAW', 'CARDINAL',
'PRODUCT', 'LANGUAGE', 'QUANTITY', 'TIME', 'PERCENT'.[0m
[38;5;2m✔ Good amount of examples for all labels[0m
[38;5;2m✔ Examples without occurrences available for all labels[0m
[38;5;2m✔ 

In [80]:
!python -m spacy train --gpu-id=0 config/config_md.cfg -o ../../data/models/spacy/md

[38;5;4mℹ Saving to output directory: ../../data/models/spacy/md[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-07-26 14:20:37,022] [INFO] Set up nlp object from config
[2023-07-26 14:20:37,032] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-07-26 14:20:37,033] [INFO] Resuming training for: ['ner', 'tok2vec']
[2023-07-26 14:20:37,039] [INFO] Created vocabulary
[2023-07-26 14:20:38,612] [INFO] Added vectors: en_core_web_md
[2023-07-26 14:20:38,690] [INFO] Finished initializing nlp object
[2023-07-26 14:20:38,690] [INFO] Initialized pipeline components: []
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00    441.75   38.98   31.98   49.90    0.39
  0     200          0.00   7708.04   81.66   81.49   81.83    0.82
  0     400          0.00   608

In [81]:
!python -m spacy benchmark accuracy --gpu-id=0 "../../data/models/spacy/md/model-best" "../../data/annotated/dev.spacy"
# !python -m spacy evaluate benchmark accuracy --help

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     100.00
NER P   89.29 
NER R   84.18 
NER F   86.66 
SPEED   11354 

[1m

             P       R       F
ORG      85.48   73.40   78.98
DEM      77.63   42.34   54.80
PERSON   90.25   95.24   92.68
DATE     92.53   94.23   93.37
GPE      87.95   85.22   86.56



In [82]:
activated = spacy.prefer_gpu()
nlp = spacy.load('../../data/models/spacy/md/model-best')

text = ['''SIDDHARTH RAGHUVANSHI                                Roll No. 06CS3025                                            DOB: 08/08/1988
Email: siddharth.iitkharagpur@gmail.com                                                                                          Mobile No.:   +91 9932584135
Degree/Certificate
Dual Degree[B. Tech (H) + M. Tech]
(Computer Science & Engineering)
Class XII:  C.B.S.E.
Class X:   C.B.S.E.
ACADEMIC ACHIEVEMENTS
Institute/ School, City
Indian Institute of Technology, Kharagpur
Central Hindu School, Varanasi
St. Atulanand Convent School, Varanasi
CGPA/ %  Completion
8.26/10
86.0%
90.8%
2011
2005
2003
Competitive
Examinations
  All India Rank 116 in AIEEE, 2006 among 470,000 students, State Rank 8 in Uttar Pradesh.
  All India Rank 119 in 7th National Science Olympiad, 2005.
  All India Rank 22 in All India Level Mathematics & Science Test organized by Central Institute  for
Proficiency in English Language (CIPEL).
Scholastic
Achievements
  National top 1% out of 26968 candidates appeared in National Standard Examination in Physics’05
  Receiving CBSE Merit Scholarship for the past 4 years.
ACADEMIC PROJECTS
M. Tech Project                                                                                      IIT Kharagpur                                           May’10-Nov’10
•
•
Studied the performance of text indexing algorithms on Hadoop MapReduce architecture.
Future work includes implementing more efficient indexing and retrieval techniques in MapReduce for distributed parallel
computing.
B. Tech Project                                                                                        IIT Kharagpur                                           Aug’09-May’10
  Developed a software with can handle all sorts of query related to geographical information extracted from maps.
  Developed a client interface which can fetch data from different incompatible geospatial web services and make that data
compatible for resolving queries.
Integrated my framework engine with different underlying heterogeneous spatial databases.

Static Instrumentation Of Java Programs                                          IIT Kharagpur                                                   May’08
  Developed a program using Byte Code Engineering Library to do automated testing of java program at byte code level.
WORK EXPERIENCE / INTERNSHIP
Extreme Blue Internship Program                                                                   ISL, IBM, Pune, India                                       May’09 – July’09
Business
Perspective
Technical
Perspective
 Achievements
  Conducted survey in Pune region on the current home delivery status of organized retails

Proposed and implemented a solution on how to increase home delivery sales in order to compete with the
localized general (kirana) stores

Built an independent Home Delivery module on Java EE platform using open standards such as XML and
Web Services
Integrated the Home Delivery module with IBM WebSphere Commerce.

  Received highest grade 10/10 in summer internship evaluation at IIT Kharagpur, 2009.
RELEVANT COURSES TAKEN
  Machine learning
  Algorithms-I
  Algorithms-II
Information Retrieval

  Distributed Systems

Probability and Statistics
POSITION OF RESPONSIBILITY

Student coordinator of IIT Kharagpur Student Counselling Service.
  Student member of team that conceptualized and publicized Counselling Centre in IIT Kharagpur after 5 successive suicides
in the campus within a span of 6 months in between Feb’09 and Jul’09.
  More than 100 students are counselled every month.
  No mishaps in the campus as of Sep’10 after the establishment of the centre.
  Went through Gate Keepers Training to identify behavioral change in a person.
  Managed  the  systems  team  of  Bitwise-2010,  an  international  algorithmic  intensive  programming  contest  leading  to  the
participation of 3000 teams across 75 countries.

Family Sub-head of accommodation team in Spring Fest, 2008.
  Head boy of my Senior Secondary School (Central Hindu School).
e
EXTRA CURRICULAR ACHIEVEMENTS
  Member of Silver winning team in inter hall OPENSOFT Competition in the session 2007-08.
  National Sports Organization: Among Top 30 students in Lawn Tennis Team at IIT Kharagpur’06. ''']

# normalize whitespace as per https://github.com/explosion/spaCy/discussions/10243
r = []
for t in text:
    r.append(re.sub(r"\s+", " ", t))

for doc in nlp.pipe(r):
    print([(ent.text, ent.label_) for ent in doc.ents])

[('Kharagpur Central Hindu School', 'ORG'), ('Varanasi St. Atulanand Convent School', 'ORG'), ('Varanasi', 'GPE'), ('2006', 'DATE'), ('Uttar Pradesh', 'GPE'), ('2005', 'DATE'), ('Central Institute for Proficiency in English Language (CIPEL)', 'ORG'), ('National Standard Examination', 'ORG'), ('4 years', 'DATE'), ('IIT', 'GPE'), ('Hadoop', 'GPE'), ('IBM', 'ORG'), ('Pune', 'GPE'), ('Pune', 'GPE'), ('Java EE', 'ORG'), ('XML and Web Services Integrated the Home Delivery', 'ORG'), ('IBM WebSphere Commerce', 'ORG'), ('IIT Kharagpur', 'GPE'), ('2009', 'DATE'), ('IIT Kharagpur Student Counselling Service', 'ORG'), ('Counselling Centre', 'ORG'), ('IIT Kharagpur', 'GPE'), ('6 months', 'DATE'), ('Feb’09', 'PERSON'), ('every month', 'DATE'), ('Bitwise-2010', 'ORG'), ('Spring Fest', 'ORG'), ('2008', 'DATE'), ('2007-08', 'DATE'), ('National Sports Organization', 'ORG'), ('Lawn Tennis Team', 'ORG'), ('IIT Kharagpur’06', 'GPE')]


In [89]:
!python -m spacy init fill-config config/base_config_md_2nd_step.cfg config/config_md_2nd_step.cfg

[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config/config_md_2nd_step.cfg
You can now add your data and train your pipeline:
python -m spacy train config_md_2nd_step.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


In [90]:
!python -m spacy debug data config/config_md_2nd_step.cfg

[1m
[38;5;2m✔ Pipeline can be initialized with data[0m
[38;5;2m✔ Corpus is loadable[0m
[1m
Language: en
Training pipeline: tok2vec, ner
Components from other pipelines: ner, tok2vec
2374 training docs
626 evaluation docs
[38;5;2m✔ No overlap between training and evaluation data[0m
[1m
[38;5;4mℹ 42743 total word(s) in the data (6579 unique)[0m
[38;5;4mℹ 20000 vectors (514157 unique keys, 300 dimensions)[0m
[38;5;3m⚠ 6136 words in training data without vectors (14%)[0m
[1m
[38;5;4mℹ 31 label(s)[0m
0 missing value(s) (tokens with '-' label)
[38;5;3m⚠ Some model labels are not present in the train data. The model
performance may be degraded for these labels after training: 'PRODUCT',
'WORK_OF_ART', 'QUANTITY', 'MONEY', 'LANGUAGE', 'PERCENT', 'FAC', 'TIME',
'CARDINAL', 'LOC', 'LAW', 'EVENT', 'NORP', 'ORDINAL', 'DEM'.[0m
[38;5;3m⚠ Low number of examples for label 'IBAN_CODE' (50)[0m
[2K[38;5;3m⚠ Low number of examples for label 'IP_ADDRESS' (15)

In [91]:
!python -m spacy train --gpu-id=0 config/config_md_2nd_step.cfg -o ../../data/models/spacy/md/2

[38;5;4mℹ Saving to output directory: ../../data/models/spacy/md/2[0m
[38;5;4mℹ Using GPU: 0[0m
[1m
[2023-07-26 14:59:15,873] [INFO] Set up nlp object from config
[2023-07-26 14:59:15,883] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-07-26 14:59:15,883] [INFO] Resuming training for: ['ner', 'tok2vec']
[2023-07-26 14:59:15,890] [INFO] Created vocabulary
[2023-07-26 14:59:17,365] [INFO] Added vectors: en_core_web_md
[2023-07-26 14:59:17,438] [INFO] Finished initializing nlp object
[2023-07-26 14:59:17,439] [INFO] Initialized pipeline components: []
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     10.95   37.32   51.82   29.16    0.37
  0     200          0.00   2007.84   45.39   48.66   42.53    0.45
  1     400          0.00   1

In [97]:
!python -m spacy benchmark accuracy --gpu-id=0 "../../data/models/spacy/md/2/model-best" "../../data/annotated/dev_step2.spacy"

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     100.00
NER P   61.28 
NER R   65.53 
NER F   63.34 
SPEED   11133 

[1m

                      P        R       F
TITLE             94.00    79.66   86.24
PERSON            74.74    87.55   80.64
GPE               60.87    35.53   44.87
ORG               10.53    30.14   15.60
AGE               44.44    51.61   47.76
CREDIT_CARD       71.43   100.00   83.33
DATE             100.00    75.71   86.18
NRP               47.17    60.98   53.19
DOMAIN_NAME      100.00    41.18   58.33
PHONE_NUMBER      58.49    64.58   61.39
STREET_ADDRESS    63.37    58.18   60.66
ZIP_CODE           0.00     0.00    0.00
EMAIL_ADDRESS     77.27    94.44   85.00
IP_ADDRESS         0.00     0.00    0.00



In [95]:
!python -m spacy benchmark accuracy --gpu-id=0 "../../data/models/spacy/md/2/model-best" "../../data/annotated/dev.spacy"
# sanity check, of sorts, to test the model against the original ECHR test dataset (trained in stage 1)

[38;5;4mℹ Using GPU: 0[0m
[1m

TOK     100.00
NER P   33.37 
NER R   35.54 
NER F   34.42 
SPEED   10823 

[1m

                     P       R       F
ORG              75.35   55.90   64.18
NRP               0.00    0.00    0.00
TITLE             0.00    0.00    0.00
PERSON           44.01   52.82   48.01
PHONE_NUMBER      0.00    0.00    0.00
GPE              61.80   74.09   67.39
DATE             55.83   11.35   18.87
STREET_ADDRESS    0.00    0.00    0.00
DEM               0.00    0.00    0.00
ZIP_CODE          0.00    0.00    0.00
CREDIT_CARD       0.00    0.00    0.00
AGE               0.00    0.00    0.00
IP_ADDRESS        0.00    0.00    0.00
IBAN_CODE         0.00    0.00    0.00
EMAIL_ADDRESS     0.00    0.00    0.00
US_SSN            0.00    0.00    0.00



In [98]:
import spacy
activated = spacy.prefer_gpu()
nlp = spacy.load('../../data/models/spacy/md/2/model-best')

text = ['''SIDDHARTH RAGHUVANSHI                                Roll No. 06CS3025                                            DOB: 08/08/1988
Email: siddharth.iitkharagpur@gmail.com                                                                                          Mobile No.:   +91 9932584135
Degree/Certificate
Dual Degree[B. Tech (H) + M. Tech]
(Computer Science & Engineering)
Class XII:  C.B.S.E.
Class X:   C.B.S.E.
ACADEMIC ACHIEVEMENTS
Institute/ School, City
Indian Institute of Technology, Kharagpur
Central Hindu School, Varanasi
St. Atulanand Convent School, Varanasi
CGPA/ %  Completion
8.26/10
86.0%
90.8%
2011
2005
2003
Competitive
Examinations
  All India Rank 116 in AIEEE, 2006 among 470,000 students, State Rank 8 in Uttar Pradesh.
  All India Rank 119 in 7th National Science Olympiad, 2005.
  All India Rank 22 in All India Level Mathematics & Science Test organized by Central Institute  for
Proficiency in English Language (CIPEL).
Scholastic
Achievements
  National top 1% out of 26968 candidates appeared in National Standard Examination in Physics’05
  Receiving CBSE Merit Scholarship for the past 4 years.
ACADEMIC PROJECTS
M. Tech Project                                                                                      IIT Kharagpur                                           May’10-Nov’10
•
•
Studied the performance of text indexing algorithms on Hadoop MapReduce architecture.
Future work includes implementing more efficient indexing and retrieval techniques in MapReduce for distributed parallel
computing.
B. Tech Project                                                                                        IIT Kharagpur                                           Aug’09-May’10
  Developed a software with can handle all sorts of query related to geographical information extracted from maps.
  Developed a client interface which can fetch data from different incompatible geospatial web services and make that data
compatible for resolving queries.
Integrated my framework engine with different underlying heterogeneous spatial databases.

Static Instrumentation Of Java Programs                                          IIT Kharagpur                                                   May’08
  Developed a program using Byte Code Engineering Library to do automated testing of java program at byte code level.
WORK EXPERIENCE / INTERNSHIP
Extreme Blue Internship Program                                                                   ISL, IBM, Pune, India                                       May’09 – July’09
Business
Perspective
Technical
Perspective
 Achievements
  Conducted survey in Pune region on the current home delivery status of organized retails

Proposed and implemented a solution on how to increase home delivery sales in order to compete with the
localized general (kirana) stores

Built an independent Home Delivery module on Java EE platform using open standards such as XML and
Web Services
Integrated the Home Delivery module with IBM WebSphere Commerce.

  Received highest grade 10/10 in summer internship evaluation at IIT Kharagpur, 2009.
RELEVANT COURSES TAKEN
  Machine learning
  Algorithms-I
  Algorithms-II
Information Retrieval

  Distributed Systems

Probability and Statistics
POSITION OF RESPONSIBILITY

Student coordinator of IIT Kharagpur Student Counselling Service.
  Student member of team that conceptualized and publicized Counselling Centre in IIT Kharagpur after 5 successive suicides
in the campus within a span of 6 months in between Feb’09 and Jul’09.
  More than 100 students are counselled every month.
  No mishaps in the campus as of Sep’10 after the establishment of the centre.
  Went through Gate Keepers Training to identify behavioral change in a person.
  Managed  the  systems  team  of  Bitwise-2010,  an  international  algorithmic  intensive  programming  contest  leading  to  the
participation of 3000 teams across 75 countries.

Family Sub-head of accommodation team in Spring Fest, 2008.
  Head boy of my Senior Secondary School (Central Hindu School).
e
EXTRA CURRICULAR ACHIEVEMENTS
  Member of Silver winning team in inter hall OPENSOFT Competition in the session 2007-08.
  National Sports Organization: Among Top 30 students in Lawn Tennis Team at IIT Kharagpur’06. ''']

# normalize whitespace as per https://github.com/explosion/spaCy/discussions/10243
r = []
for t in text:
    r.append(re.sub(r"\s+", " ", t))

for doc in nlp.pipe(r):
    print([(ent.text, ent.label_) for ent in doc.ents])

[('SIDDHARTH RAGHUVANSHI Roll No. 06CS3025 DOB', 'ORG'), ('+91 9932584135 Degree/Certificate Dual Degree[B. Tech (H) + M. Tech] (', 'PHONE_NUMBER'), ('C.B.S.E. ACADEMIC ACHIEVEMENTS Institute/ School', 'ORG'), ('City Indian Institute of Technology', 'ORG'), ('Kharagpur Central Hindu School', 'ORG'), ('Varanasi St. Atulanand Convent School', 'GPE'), ('Varanasi CGPA/ % Completion 8.26/10 86.0% 90.8% 2011', 'ORG'), ('2005', 'DATE'), ('2003', 'STREET_ADDRESS'), ('Competitive Examinations', 'ORG'), ('116', 'ZIP_CODE'), ('AIEEE', 'GPE'), ('2006', 'DATE'), ('470,000', 'PHONE_NUMBER'), ('State Rank 8', 'STREET_ADDRESS'), ('Uttar Pradesh', 'GPE'), ('119', 'ZIP_CODE'), ('7th National Science Olympiad', 'GPE'), ('2005', 'DATE'), ('22', 'ZIP_CODE'), ('All India Level Mathematics & Science Test', 'STREET_ADDRESS'), ('Central Institute', 'ORG'), ('Proficiency in English Language (CIPEL)', 'TITLE'), ('26968', 'PHONE_NUMBER'), ('National Standard Examination', 'ORG'), ('IIT Kharagpur', 'NRP'), ('Hadoo