# Computing with Language: Simple Statistics

# Frequency Distributions

Frequency Distributions, in the context of Natural Language Processing (NLP), are, put simply, the tally of the number of times that each unique word is present in the given piece of text. Recording the frequency distribution can be used to determine what words are high/ low in significance to the given context or in general and to determine the summary/ gist/ topic of discussion that the text is based on. 

NLTK makes it easier to deal with frequency distributions and various pertinent statistical analyses.

IMPLEMENTATION: 

In [1]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [3]:
#Accessing first 50 words of one of the books
ex= text1[:50]
ex

['[',
 'Moby',
 'Dick',
 'by',
 'Herman',
 'Melville',
 '1851',
 ']',
 'ETYMOLOGY',
 '.',
 '(',
 'Supplied',
 'by',
 'a',
 'Late',
 'Consumptive',
 'Usher',
 'to',
 'a',
 'Grammar',
 'School',
 ')',
 'The',
 'pale',
 'Usher',
 '--',
 'threadbare',
 'in',
 'coat',
 ',',
 'heart',
 ',',
 'body',
 ',',
 'and',
 'brain',
 ';',
 'I',
 'see',
 'him',
 'now',
 '.',
 'He',
 'was',
 'ever',
 'dusting',
 'his',
 'old',
 'lexicons',
 'and']

In [4]:
fdist=FreqDist(text1)

In [6]:
#Getting to know the count of number of unique words

count=len(fdist)
count

19317

In [8]:
#Knowing about 50 most common words and their respective frequencies

fdist.most_common(50)

[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982),
 ("'", 2684),
 ('-', 2552),
 ('his', 2459),
 ('it', 2209),
 ('I', 2124),
 ('s', 1739),
 ('is', 1695),
 ('he', 1661),
 ('with', 1659),
 ('was', 1632),
 ('as', 1620),
 ('"', 1478),
 ('all', 1462),
 ('for', 1414),
 ('this', 1280),
 ('!', 1269),
 ('at', 1231),
 ('by', 1137),
 ('but', 1113),
 ('not', 1103),
 ('--', 1070),
 ('him', 1058),
 ('from', 1052),
 ('be', 1030),
 ('on', 1005),
 ('so', 918),
 ('whale', 906),
 ('one', 889),
 ('you', 841),
 ('had', 767),
 ('have', 760),
 ('there', 715),
 ('But', 705),
 ('or', 697),
 ('were', 680),
 ('now', 646),
 ('which', 640),
 ('?', 637),
 ('me', 627),
 ('like', 624)]

# Fine-grained Selection of Words

Fine- grained selection of words is essential because while performing natural language processing (NLP) mostly with Natural language toolkit (NLTK) and other libraries, we deal with texts of huge size (including corpora, web- based textual information, etc.) and we are usually only interested in dealing with a particular small piece of text and or specific feauture, property, dimension, etc. So, we need some technically efficient and accurate methods to get fine-grained portions (usually words) of a large piece of text.

Sometimes, we take assistance from set theory and we apply set operations to speed up the process, since set operations assist in de- duplication and it can be really helpful in situations where there is a large- size input text and there are a lot of duplicates of a lot of words , reducing the computational time for which significantly increases the technical efficiency of the computational process under consideration. 

EXAMPLE OF SIGNIFICANCE OF SET OPERATIONS: 

In [10]:
#Defining the text to be considered

original_text="Artificial Intelligence (AI) By JAKE FRANKENFIELD Updated March 08, 2021 Reviewed by GORDON SCOTT What Is Artificial Intelligence (AI)? Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving. The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal. A subset of artificial intelligence is machine learning, which refers to the concept that computer programs can automatically learn from and adapt to new data without being assisted by humans. Deep learning techniques enable this automatic learning through the absorption of huge amounts of unstructured data such as text, images, or video. KEY TAKEAWAYS Artificial intelligence refers to the simulation of human intelligence in machines. The goals of artificial intelligence include learning, reasoning, and perception. AI is being used across different industries including finance and healthcare. Weak AI tends to be simple and single-task oriented, while strong AI carries on tasks that are more complex and human-like. What if you had started investing years ago? Find out what a hypothetical investment would be worth today. SELECT A STOCK TSLA TESLA INC AAPL APPLE INC NKE NIKE INC AMZN AMAZON.COM, INC WMT WALMART INC SELECT INVESTMENT AMOUNT $ 1,000 SELECT A PURCHASE DATE 5 years ago CALCULATE Understanding Artificial Intelligence (AI) When most people hear the term artificial intelligence, the first thing they usually think of is robots. That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Earth. But nothing could be further from the truth. Artificial intelligence is based on the principle that human intelligence can be defined in a way that a machine can easily mimic it and execute tasks, from the most simple to those that are even more complex. The goals of artificial intelligence include mimicking human cognitive activity. Researchers and developers in the field are making surprisingly rapid strides in mimicking activities such as learning, reasoning, and perception, to the extent that these can be concretely defined. Some believe that innovators may soon be able to develop systems that exceed the capacity of humans to learn or reason out any subject. But others remain skeptical because all cognitive activity is laced with value judgments that are subject to human experience. As technology advances, previous benchmarks that defined artificial intelligence become outdated. For example, machines that calculate basic functions or recognize text through optical character recognition are no longer considered to embody artificial intelligence, since this function is now taken for granted as an inherent computer function. AI is continuously evolving to benefit many different industries. Machines are wired using a cross-disciplinary approach based on mathematics, computer science, linguistics, psychology, and more. Algorithms often play a very important part in the structure of artificial intelligence, where simple algorithms are used in simple applications, while more complex ones help frame strong artificial intelligence. Applications of Artificial Intelligence The applications for artificial intelligence are endless. The technology can be applied to many different sectors and industries. AI is being tested and used in the healthcare industry for dosing drugs and different treatment in patients, and for surgical procedures in the operating room. Other examples of machines with artificial intelligence include computers that play chess and self-driving cars. Each of these machines must weigh the consequences of any action they take, as each action will impact the end result. In chess, the end result is winning the game. For self-driving cars, the computer system must account for all external data and compute it to act in a way that prevents a collision. Artificial intelligence also has applications in the financial industry, where it is used to detect and flag activity in banking and finance such as unusual debit card usage and large account deposits—all of which help a bank's fraud department. Applications for AI are also being used to help streamline and make trading easier. This is done by making supply, demand, and pricing of securities easier to estimate. Categorization of Artificial Intelligence Artificial intelligence can be divided into two different categories: weak and strong. Weak artificial intelligence embodies a system designed to carry out one particular job. Weak AI systems include video games such as the chess example from above and personal assistants such as Amazon's Alexa and Apple's Siri. You ask the assistant a question, it answers it for you. Strong artificial intelligence systems are systems that carry on the tasks considered to be human-like. These tend to be more complex and complicated systems. They are programmed to handle situations in which they may be required to problem solve without having a person intervene. These kinds of systems can be found in applications like self-driving cars or in hospital operating rooms."

In [18]:
#Split the given string text into individual word- tokens
text= original_text.split()

#print(text)
print("The number of words in list version of the text split into words: "+str(len(text)))

The number of words in list version of the text split into words: 817


In [21]:
#Get the set version of the text
set_text= set(text)

#print(set_text)
print("The number of words in set version of the text split into words (i.e, size of VOCABULARY): "+str(len(set_text)))

The number of words in set version of the text split into words (i.e, size of VOCABULARY): 421


Note: We can clearly see how significantly choosing the set mode for computational operations can increase the technical efficiency of the computational operations.

We can perform the following kind of operations to perform fine- grained analysis on words:

In [25]:
#Defining the V variable for vocabulary
V= set_text

#Filtering off the words that are significantly long
long_words = [w for w in V if len(w) > 8]

print("Long words of lengths more than 8  in the vocabulary are: ")
print()
print(long_words)

Long words of lengths more than 8  in the vocabulary are: 

['complicated', 'activities', 'linguistics,', 'intelligence', 'reasoning,', 'Understanding', 'associated', 'intelligence.', 'situations', 'cognitive', 'human-like.', 'judgments', 'cross-disciplinary', 'department.', 'achieving', 'mimicking', 'treatment', 'estimate.', 'INVESTMENT', 'applications,', 'concretely', 'assistants', 'securities', 'different', 'recognition', 'industries', 'technology', 'problem-solving.', 'mathematics,', 'FRANKENFIELD', 'outdated.', 'collision.', 'streamline', 'unstructured', 'big-budget', 'algorithms', 'continuously', 'perception.', 'developers', 'automatic', 'activity.', 'functions', 'Categorization', 'AMAZON.COM,', 'Algorithms', 'healthcare', 'structure', 'Intelligence', 'character', 'Artificial', 'calculate', 'Applications', 'principle', 'oriented,', 'intelligence,', 'self-driving', 'hypothetical', 'function.', 'including', 'consequences', 'intervene.', 'artificial', 'psychology,', 'industries.', '

In [26]:
#Performing sort operation on long_words so that the filtered- off list is, now, even more organized and aligned

print("Sorted words of lengths more than 8  in the vocabulary are: ")
print()
print(sorted(long_words))

Sorted words of lengths more than 8  in the vocabulary are: 

['AMAZON.COM,', 'Algorithms', 'Applications', 'Artificial', 'CALCULATE', 'Categorization', 'FRANKENFIELD', 'INVESTMENT', 'Intelligence', 'Researchers', 'TAKEAWAYS', 'Understanding', 'absorption', 'achieving', 'activities', 'activity.', 'advances,', 'algorithms', 'applications', 'applications,', 'artificial', 'assistant', 'assistants', 'associated', 'automatic', 'automatically', 'benchmarks', 'big-budget', 'calculate', 'categories:', 'character', 'characteristic', 'cognitive', 'collision.', 'complicated', 'computers', 'concretely', 'consequences', 'considered', 'continuously', 'cross-disciplinary', 'department.', 'deposits—all', 'developers', 'different', 'estimate.', 'experience.', 'financial', 'function.', 'functions', 'healthcare', 'healthcare.', 'human-like', 'human-like.', 'hypothetical', 'important', 'including', 'industries', 'industries.', 'industry,', 'innovators', 'intelligence', 'intelligence,', 'intelligence.', 'i

In [28]:
#Performing the upper-case to lower- case word transformation to get all the words in lower case

lower_case_sorted_long_words= [w.lower() for w in sorted(long_words)]

print("All- lowercase versions of sorted words of lengths more than 8  in the vocabulary are: ")
print()
print(lower_case_sorted_long_words)

All- lowercase versions of sorted words of lengths more than 8  in the vocabulary are: 

['amazon.com,', 'algorithms', 'applications', 'artificial', 'calculate', 'categorization', 'frankenfield', 'investment', 'intelligence', 'researchers', 'takeaways', 'understanding', 'absorption', 'achieving', 'activities', 'activity.', 'advances,', 'algorithms', 'applications', 'applications,', 'artificial', 'assistant', 'assistants', 'associated', 'automatic', 'automatically', 'benchmarks', 'big-budget', 'calculate', 'categories:', 'character', 'characteristic', 'cognitive', 'collision.', 'complicated', 'computers', 'concretely', 'consequences', 'considered', 'continuously', 'cross-disciplinary', 'department.', 'deposits—all', 'developers', 'different', 'estimate.', 'experience.', 'financial', 'function.', 'functions', 'healthcare', 'healthcare.', 'human-like', 'human-like.', 'hypothetical', 'important', 'including', 'industries', 'industries.', 'industry,', 'innovators', 'intelligence', 'intellig