<h1 style="text-align: center;">Financial Accounting - HW7</h1> 

<p style="text-align: right;"><i>Anirudh Narayanan</i></p>

In [1]:
import pandas as pd
import numpy as np
import pylab as pl
import re

## Q1 - Benford's Law

#### Generating the samples 

In [2]:
x = np.random.uniform(0,1, 1000)
y = np.random.normal(1,1,1000)
z = np.random.normal(0,1,1000)

### A) Testing Benford's law on each of x, y and z

In [3]:
#Getting lists of first digits and the % occurance of each digit

x_1stdigit = pd.Series([[digit for digit in list(num) if digit not in ('0','.','-')][0]  for num in x.astype(str)]).astype(int)
x_digcounts = pd.DataFrame( x_1stdigit.value_counts().sort_index()/1000,columns = ['x'])

y_1stdigit = pd.Series([[digit for digit in list(num) if digit not in ('0','.','-')][0]  for num in y.astype(str)]).astype(int)
y_digcounts = pd.DataFrame( y_1stdigit.value_counts().sort_index()/1000,columns = ['y'])

z_1stdigit = pd.Series([[digit for digit in list(num) if digit not in ('0','.','-')][0]  for num in z.astype(str)]).astype(int)
z_digcounts = pd.DataFrame( z_1stdigit.value_counts().sort_index()/1000,columns = ['z'])

<b><i>Calculating K-S Statistic</i></b>

In [4]:
#Creating a dataframe of expected distributions and adding the above dataframes
digit_distributions = pd.DataFrame([np.log10(1+(1/float(d))) for d in range(1,10)]
                                   ,index = range(1,10), columns = ['expected'])

digit_distributions = pd.concat([digit_distributions,x_digcounts, y_digcounts, z_digcounts], axis = 1 )

digit_distributions.head()

Unnamed: 0,expected,x,y,z
1,0.30103,0.104,0.418,0.355
2,0.176091,0.109,0.202,0.143
3,0.124939,0.113,0.072,0.086
4,0.09691,0.138,0.064,0.079
5,0.079181,0.106,0.048,0.065


In [5]:
#Cumulative sums and differences from expected values
digit_dist_cumsum = digit_distributions.cumsum()

digit_dist_cumsum['diff_x'] = (digit_dist_cumsum.x - digit_dist_cumsum.expected).abs()
digit_dist_cumsum['diff_y'] = (digit_dist_cumsum.y - digit_dist_cumsum.expected).abs()
digit_dist_cumsum['diff_z'] = (digit_dist_cumsum.z - digit_dist_cumsum.expected).abs()

#K-S Statistic for each of x, y and z
ks_x = digit_dist_cumsum.diff_x.max()
ks_y = digit_dist_cumsum.diff_y.max()
ks_z = digit_dist_cumsum.diff_z.max()

#Cut-off value
cutoff = 1.36/np.sqrt(1000)

print 'Cutoff: {}\nKS stat for x: {}\nKS stat for y: {}\nKS stat for z: {}'.format(cutoff, ks_x, ks_y, ks_z)

Cutoff: 0.0430069761783
KS stat for x: 0.276059991328
KS stat for y: 0.14287874528
KS stat for z: 0.053970004336


<u><i>Conclusion:</i></u>  
Thus we can see that none of the three distributions obey Benford's Law.

<u><i>Explanation:</i></u>  
Benford's Law applies to be data that come from a multiplicative random variable process. Since none of x, y or z come from such a process, we see that none of them conform to Benford's Law.


### B) Testing Benford's Law on x\*y\*z

In [6]:
xyz = x*y*z

xyz_1stdigit = pd.Series([[digit for digit in list(num) if digit not in ('0','.','-')][0]  for num in xyz.astype(str)]).astype(int)
xyz_digcounts = pd.DataFrame( xyz_1stdigit.value_counts().sort_index()/1000,columns = ['xyz'])

ks_xyz = xyz_digcounts.xyz.cumsum().subtract(digit_dist_cumsum.expected).abs().max()

print 'Cutoff: {}\nKS stat for x: {}'.format(cutoff,ks_xyz)

Cutoff: 0.0430069761783
KS stat for x: 0.017029995664


<u><i>Conclusion:</i></u>  
Here we see that x\*y\*z obeys Benford's Law.

<u><i>Explanation:</i></u>  
Benford's Law applies to be data that come from a multiplicative random variable process. By definition, x\*y\*z comes from multiplication of three random variables. Hence x\*y\*z is expected to obey Benford's Law


### C) Testing Benford's Law on x\*y\*z after rounding

In [7]:
xyz_round = xyz.round(1)

xyz_r_1stdigit = pd.Series([[digit for digit in list(num) if digit not in ('0','.','-')][0] \
                            for num in xyz_round.astype(str) if num != '0.0']).astype(int)
xyz_r_digcounts = pd.DataFrame( xyz_r_1stdigit.value_counts().sort_index()/len(xyz_r_1stdigit),columns = ['xyz_r'])


ks_xyz_r = xyz_r_digcounts.xyz_r.cumsum().subtract(digit_dist_cumsum.expected).abs().max()
print 'Cutoff: {}\nKS stat for x: {}'.format(cutoff,ks_xyz_r)

Cutoff: 0.0430069761783
KS stat for x: 0.0968583371171


<u><i>Conclusion:</i></u>  
Here we see that xyz_round does not obey Benford's Law.

<u><i>Explanation:</i></u>  
Although the underlying data (x\*y\*z) is multiplicative in nature, there is still a manipulation of data (through the process of rounding). Thus the data isn't completely representative of a multiplicative process and does not obey Benford's Law

<br>
<br>

## Q2 - Bag of Words

### A) Word Count


In [8]:
#Loading the file and getting a list of words
tesla_words = []

with open('HW7_Tesla_2015.txt') as f:
    for line in f:
        words = re.findall(re.compile('\w+'), line.upper())
        tesla_words = tesla_words + [word for word in words if word.isdigit()==False]
        
tesla_words = pd.DataFrame(tesla_words,columns = ['tesla'])
print 'Count of words is {}'.format(len(tesla_words))

Count of words is 1939


### B) Calculating sentiment

In [9]:
#Loading the positive and negative words
positive_words = pd.read_csv('HW7_LM_pos_words.txt', squeeze=True, header=None)
negative_words = pd.read_csv('HW7_LM_neg_words.txt', squeeze=True, header=None)

positive_count = sum(tesla_words.tesla.isin(positive_words))
negative_count = sum(tesla_words.tesla.isin(negative_words))

print 'Sentiment Values:\ni) positive - negative / total :{}\nii) negative / total words : {}'.\
            format((positive_count-negative_count)/float(len(tesla_words)), negative_count/float(len(tesla_words)))

Sentiment Values:
i) positive - negative / total :0.00257864878804
ii) negative / total words : 0.00825167612171


### C) Are there negator words?

In [10]:
words = tesla_words.tesla.isin(['NOT','NO','NEVER'])

print 'Number of negator words in the file are: {}'.format(sum(words))

Number of negator words in the file are: 7
