# A. Import Data

In [119]:
import re
from collections import Counter

In [103]:
data = """As a term, data analytics predominantly refers to an assortment of applications, from basic business
intelligence (BI), reporting and online analytical processing (OLAP) to various forms of advanced
analytics. In that sense, it's similar in nature to business analytics, another umbrella term for
approaches to analyzing data -- with the difference that the latter is oriented to business uses, while
data analytics has a broader focus. The expansive view of the term isn't universal, though: In some
cases, people use data analytics specifically to mean advanced analytics, treating BI as a separate
category. Data analytics initiatives can help businesses increase revenues, improve operational
efficiency, optimize marketing campaigns and customer service efforts, respond more quickly to
emerging market trends and gain a competitive edge over rivals -- all with the ultimate goal of
boosting business performance. Depending on the particular application, the data that's analyzed
can consist of either historical records or new information that has been processed for real-time
analytics uses. In addition, it can come from a mix of internal systems and external data sources. At
a high level, data analytics methodologies include exploratory data analysis (EDA), which aims to find
patterns and relationships in data, and confirmatory data analysis (CDA), which applies statistical
techniques to determine whether hypotheses about a data set are true or false. EDA is often
compared to detective work, while CDA is akin to the work of a judge or jury during a court trial -- a
distinction first drawn by statistician John W. Tukey in his 1977 book Exploratory Data Analysis. Data
analytics can also be separated into quantitative data analysis and qualitative data analysis. The
former involves analysis of numerical data with quantifiable variables that can be compared or
measured statistically. The qualitative approach is more interpretive -- it focuses on understanding
the content of non-numerical data like text, images, audio and video, including common phrases,
themes and points of view."""

# B. Text Analysis

## B.1 What is the probability of the word “data” occurring in each line ?
number of "data" / number of lines =  0.636

In [104]:
#split data by lines
data_split_lines = data.lower().splitlines()

In [105]:
#count word "data" occurrences in each line
word_data_occurrences = 0
for line in data_split_lines:
    if 'data' in line:
        word_data_occurrences += 1

In [106]:
prob_word_data_occuring_in_each_line = word_data_occurrences / len(data_split_lines)
prob_word_data_occuring_in_each_line

0.6363636363636364

# C. Create Unigram Word Model

In [107]:
def tokenize(string):
    return re.compile('\w+').findall(string)

def word_freq(string): 
    text = tokenize(string.lower())
    c = Counter(text) #count the words
    return dict(c)

In [108]:
words = word_freq(data) # count and get dicts with counts

In [109]:
sum_words = sum(words.values())# sum total words
sum_words

320

## C.1 What is the distribution of distinct word counts across all the lines ?

In [116]:
words

{'1977': 1,
 'a': 10,
 'about': 1,
 'addition': 1,
 'advanced': 2,
 'aims': 1,
 'akin': 1,
 'all': 1,
 'also': 1,
 'an': 1,
 'analysis': 6,
 'analytical': 1,
 'analytics': 10,
 'analyzed': 1,
 'analyzing': 1,
 'and': 9,
 'another': 1,
 'application': 1,
 'applications': 1,
 'applies': 1,
 'approach': 1,
 'approaches': 1,
 'are': 1,
 'as': 2,
 'assortment': 1,
 'at': 1,
 'audio': 1,
 'basic': 1,
 'be': 2,
 'been': 1,
 'bi': 2,
 'book': 1,
 'boosting': 1,
 'broader': 1,
 'business': 4,
 'businesses': 1,
 'by': 1,
 'campaigns': 1,
 'can': 5,
 'cases': 1,
 'category': 1,
 'cda': 2,
 'come': 1,
 'common': 1,
 'compared': 2,
 'competitive': 1,
 'confirmatory': 1,
 'consist': 1,
 'content': 1,
 'court': 1,
 'customer': 1,
 'data': 18,
 'depending': 1,
 'detective': 1,
 'determine': 1,
 'difference': 1,
 'distinction': 1,
 'drawn': 1,
 'during': 1,
 'eda': 2,
 'edge': 1,
 'efficiency': 1,
 'efforts': 1,
 'either': 1,
 'emerging': 1,
 'expansive': 1,
 'exploratory': 2,
 'external': 1,
 'false':

## C.2 What is the probability of the word “analytics” occurring after the word “data” ?
Prob("analytics"|"data") = Prob("data analytics")/Prob("data") = 0.278

In [114]:
def get_multi_word_occurrence(word, input_string):
    return sum(1 for _ in re.finditer(r'\b%s\b' % re.escape(word), input_string.lower()))

In [113]:
prob_data_analytics = get_multi_word_occurrence("data analytics", data) / sum_words
prob_data = words["data"] / sum_words
prob_analytics_given_data = prob_data_analytics / prob_data

prob_analytics_given_data

0.2777777777777778