<a href="https://colab.research.google.com/github/darshika1994/Advanced-Data-Analysis-using-Pandas/blob/master/Assignment_Software_Installation_and_Getting_Started.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **1. Installing and Importing Spacy Library**

In [1]:
!pip install spacy



In [3]:
import spacy

### **2. Loading Spacy English Language Model & Processing Text from News Article**

*The model "en_core_web_sm" is a small English Language Model provided by spaCy for tasks like tokenization, POS tagging, and dependency parsing. To use this model, we need to download and install the 'en_core_web_sm' model using spaCy's download command :*

**`python -m spacy download en_core_web_sm`**

In [4]:
nlp = spacy.load("en_core_web_sm")

In [36]:
processed_text = nlp("Tech giants faced scrutiny as regulators announced a probe into potential antitrust violations, signaling a new era of scrutiny for big tech.")

In [10]:
for token in processed_text:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_alpha, token.is_stop)

Tech tech NOUN NN compound True False
giants giant NOUN NNS nsubj True False
faced face VERB VBD ROOT True False
scrutiny scrutiny NOUN NN dobj True False
as as SCONJ IN mark True True
regulators regulator NOUN NNS nsubj True False
announced announce VERB VBD advcl True False
a a DET DT det True True
probe probe NOUN NN dobj True False
into into ADP IN prep True True
potential potential ADJ JJ amod True False
antitrust antitrust ADJ JJ amod True False
violations violation NOUN NNS pobj True False
, , PUNCT , punct False False
signaling signal VERB VBG advcl True False
a a DET DT det True True
new new ADJ JJ amod True False
era era NOUN NN dobj True False
of of ADP IN prep True True
scrutiny scrutiny NOUN NN pobj True False
for for ADP IN prep True True
big big ADJ JJ amod True False
tech tech NOUN NN pobj True False
. . PUNCT . punct False False


### **3. Examing Spacy Output**

<font color='green'>1. Did all the words break apart in a way that you expected it to – we asked spacy to identify all the separate words in a sentence. Is that what it did? </font>

Yes, this code breaks apart the text into separate words/tokens and provides various linguistic attributes for each word like it's part of speech tag, whether it's a stop word, which is what wou would expect from using spaCy for tokenization.

<font color='green'>2. Does there appear to be a common part of speech from your words? What does that imply? </font>

From the output of the code, it is worth noting that many of the words share the same part-of-speech tag, specifically NOUN (noun).
The fact that many words are tagged as nouns suggests that nouns dominate the parts of speech in this particular sentence. Nouns typically represent people, places, things, or ideas, and they often serve as the subject or object of a sentence.

In [43]:
from collections import Counter
pos_tags = [token.pos_ for token in processed_text]
pos_frequency = Counter(pos_tags)
pos_frequency_sorted = pos_frequency.most_common()
print(pos_frequency_sorted)

[('NOUN', 9), ('ADJ', 4), ('VERB', 3), ('ADP', 3), ('DET', 2), ('PUNCT', 2), ('SCONJ', 1)]


So, we can see that maximum times noun (i.e. 9) have appeared in the processed text followed by adjective.

<font color='green'>3. Examining the last column (is_stop) what kinds of words do you expect stop words to be? Do you think they will be useful for text analysis? </font>

Stop words are basically the little words in a sentence like "the," "and," "is," "of," etc. They're important for putting sentences together correctly, but they don't add much meaning on their own and are insignificant. When we check the last column (is_stop), we can see words like "as," "a," "for," "into," "of," etc. They're essential for making sentences work, but they're not very useful for understanding the main point or feeling of a piece of text.

In [42]:
stop_words = []

for token in processed_text:
    if token.is_stop:
        stop_words.append(token.text)

print(stop_words)

['as', 'a', 'into', 'a', 'of', 'for']


There are stop words in the processed text i.e. "as", "a","for". That's why in tasks like figuring out what a piece of writing is about or how people feel about it, it's often a good practice to ignore or get rid of these stop words.

### **4. Examine Spacy Output 2**

In [17]:
processed_text = nlp("Tech giants faced scrutiny as regulators announced a probe into potential antitrust violations, signaling a new era of scrutiny for big tech.")
for ent in processed_text.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

In our text above, there are no named entities recognized by spaCy, which is we are not getting any output. Named entities are specific entities like people, organizations, locations, dates, etc., recognized by the NER (Named Entity Recognition) component of the spaCy model. But in our sentence, there aren't any such named entities (like people, location, dates etc) present.

*So, using a different text which has named entities in it :*

In [20]:
processed_text = nlp("Elon Musk's SpaceX successfully launched a new satellite into orbit from their facility in Cape Canaveral, Florida.")
for ent in processed_text.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Elon Musk's 0 11 PERSON
Cape Canaveral 91 105 GPE
Florida 107 114 GPE


<font color='green'>1. Given the output, what do you think named entities are? </font>

As already mentioned above, Named entities are specific entities like people, organizations, locations, dates, etc., recognized by the NER (Named Entity Recognition) component of the spaCy model. In the above txt, the named entities are individuals (such as "Elon Musk"), locations (such as "Cape Canaveral" and "Florida").

<font color='green'>2. How does the output differ from the text provided by printing the parts of speech? </font>

Named entities selects specific things like people or places, while parts of speech tell you about the grammar of each word. Named entities give you the big picture, like "Elon Musk" or "SpaceX," while parts of speech dive into the details, like whether a word is a noun or a verb, as we have seen in previous example of Tecg giants. Named entities cover only certain important words, while parts of speech cover every word in the text. Both are useful for understanding text, but they focus on different aspects: named entities are particularly useful for tasks involving entity recognition, extraction and parts of speech are essential for syntactic parsing & grammatical analysis.

In short, Named entity labels (e.g., PERSON, ORGANIZATION, GPE) categorize named entities into broad semantic categories, while part-of-speech tags (e.g., NOUN, VERB, ADJECTIVE) classify words based on their grammatical roles.

<font color='green'>3. What might you guess spacy tends to label in named entity recognition? </font>

SpaCy's named entity recognition (NER) typically labels entities that represent specific types of information within text. Some common categories of entities that spaCy tends to label include:

*   PERSON: Names of people or characters.
*   ORGANIZATION: Names of companies, institutions, or groups.
*   GPE (Geopolitical Entity): Names of countries, cities, states, or other geopolitical entities.
*   DATE: Specific dates or time expressions.
*   TIME: Specific times or time expressions.
*   PRODUCT: Names of products or items.
*   EVENT: Names of events or occurrences.
*   MONEY: Monetary values or currency symbols.
*   QUANTITY: Measurements or numerical quantities.
*   FACILITY: Names of buildings, airports, or other facilities.


In the text above, we have got **person (elon musk) and gpe (cape canaveral and florida) as output** which are the named entities. The exact categories might change depending on the language and version of the spaCy model you're using, but the main goal is to find and label important stuff in the text.