In [1]:
# from https://en.wikipedia.org/wiki/Inflation

In [2]:
document_text = """

In economics, inflation (or less frequently, price inflation) is a general rise in the price level of an economy
over a period of time.[1][2][3][4] When the general price level rises, each unit of currency buys fewer goods and
services; consequently, inflation reflects a reduction in the purchasing power per unit of money – a loss of real
value in the medium of exchange and unit of account within the economy.[5][6] The opposite of inflation is 
deflation, a sustained decrease in the general price level of goods and services. The common measure of inflation
is the inflation rate, the annualised percentage change in a general price index, usually the consumer price 
index, over time.[7]

Economists believe that very high rates of inflation and hyperinflation are harmful, and are caused by excessive
growth of the money supply.[8] Views on which factors determine low to moderate rates of inflation are more 
varied. Low or moderate inflation may be attributed to fluctuations in real demand for goods and services, or 
changes in available supplies such as during scarcities.[9] However, the consensus view is that a long sustained 
period of inflation is caused by money supply growing faster than the rate of economic growth.[10][11]

Inflation affects economies in various positive and negative ways. The negative effects of inflation include an 
increase in the opportunity cost of holding money, uncertainty over future inflation which may discourage 
investment and savings, and if inflation were rapid enough, shortages of goods as consumers begin hoarding out 
of concern that prices will increase in the future. Positive effects include reducing unemployment due to nominal 
wage rigidity,[12] allowing the central bank greater freedom in carrying out monetary policy, encouraging loans 
and investment instead of money hoarding, and avoiding the inefficiencies associated with deflation.

Today, most economists favour a low and steady rate of inflation.[13] Low (as opposed to zero or negative) 
inflation reduces the severity of economic recessions by enabling the labor market to adjust more quickly in a 
downturn, and reduces the risk that a liquidity trap prevents monetary policy from stabilising the economy.[14] 
The task of keeping the rate of inflation low and stable is usually given to monetary authorities. Generally, 
these monetary authorities are the central banks that control monetary policy through the setting of interest 
rates, through open market operations, and through the setting of banking reserve requirements.[15]
"""

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [4]:
vectorizer = CountVectorizer()

In [5]:
X_train = vectorizer.fit_transform([document_text])

In [6]:
X_train

<1x194 sparse matrix of type '<class 'numpy.longlong'>'
	with 194 stored elements in Compressed Sparse Row format>

In [7]:
vectorizer.get_feature_names()

['10',
 '11',
 '12',
 '13',
 '14',
 '15',
 'account',
 'adjust',
 'affects',
 'allowing',
 'an',
 'and',
 'annualised',
 'are',
 'as',
 'associated',
 'attributed',
 'authorities',
 'available',
 'avoiding',
 'bank',
 'banking',
 'banks',
 'be',
 'begin',
 'believe',
 'buys',
 'by',
 'carrying',
 'caused',
 'central',
 'change',
 'changes',
 'common',
 'concern',
 'consensus',
 'consequently',
 'consumer',
 'consumers',
 'control',
 'cost',
 'currency',
 'decrease',
 'deflation',
 'demand',
 'determine',
 'discourage',
 'downturn',
 'due',
 'during',
 'each',
 'economic',
 'economics',
 'economies',
 'economists',
 'economy',
 'effects',
 'enabling',
 'encouraging',
 'enough',
 'excessive',
 'exchange',
 'factors',
 'faster',
 'favour',
 'fewer',
 'fluctuations',
 'for',
 'freedom',
 'frequently',
 'from',
 'future',
 'general',
 'generally',
 'given',
 'goods',
 'greater',
 'growing',
 'growth',
 'harmful',
 'high',
 'hoarding',
 'holding',
 'however',
 'hyperinflation',
 'if',
 'in',

In [8]:
# look at
## 'transmission',
## 'transmissions
## 'transmit'

In [9]:
# lemmatized words
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

# lemmatized words
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [10]:
print(stemmer.stem("transmission"))
print(stemmer.stem("transmissions"))
print(stemmer.stem("transmit"))

transmiss
transmiss
transmit


In [11]:
print(lemmatizer.lemmatize("transmission"))
print(lemmatizer.lemmatize("transmissions"))
print(lemmatizer.lemmatize("transmit"))

transmission
transmission
transmit


In [12]:
lemma_text =  ' '.join(lemmatizer.lemmatize(w) for w in document_text.split())

In [13]:
lemma_text

'In economics, inflation (or le frequently, price inflation) is a general rise in the price level of an economy over a period of time.[1][2][3][4] When the general price level rises, each unit of currency buy fewer good and services; consequently, inflation reflects a reduction in the purchasing power per unit of money – a loss of real value in the medium of exchange and unit of account within the economy.[5][6] The opposite of inflation is deflation, a sustained decrease in the general price level of good and services. The common measure of inflation is the inflation rate, the annualised percentage change in a general price index, usually the consumer price index, over time.[7] Economists believe that very high rate of inflation and hyperinflation are harmful, and are caused by excessive growth of the money supply.[8] Views on which factor determine low to moderate rate of inflation are more varied. Low or moderate inflation may be attributed to fluctuation in real demand for good and

In [14]:
stem_text =  ' '.join(stemmer.stem(w) for w in document_text.split())

In [15]:
stem_text

'In economics, inflat (or less frequently, price inflation) is a gener rise in the price level of an economi over a period of time.[1][2][3][4] when the gener price level rises, each unit of currenc buy fewer good and services; consequently, inflat reflect a reduct in the purchas power per unit of money – a loss of real valu in the medium of exchang and unit of account within the economy.[5][6] the opposit of inflat is deflation, a sustain decreas in the gener price level of good and services. the common measur of inflat is the inflat rate, the annualis percentag chang in a gener price index, usual the consum price index, over time.[7] economist believ that veri high rate of inflat and hyperinfl are harmful, and are caus by excess growth of the money supply.[8] view on which factor determin low to moder rate of inflat are more varied. low or moder inflat may be attribut to fluctuat in real demand for good and services, or chang in avail suppli such as dure scarcities.[9] however, the c

In [16]:
stem_vectorizer = CountVectorizer()
stem_vectorizer.fit_transform([stem_text])
stem_vectorizer.get_feature_names()

['10',
 '11',
 '12',
 '13',
 '14',
 '15',
 'account',
 'adjust',
 'affect',
 'allow',
 'an',
 'and',
 'annualis',
 'are',
 'as',
 'associ',
 'attribut',
 'author',
 'authorities',
 'avail',
 'avoid',
 'bank',
 'be',
 'begin',
 'believ',
 'buy',
 'by',
 'carri',
 'caus',
 'central',
 'chang',
 'common',
 'concern',
 'consensu',
 'consequently',
 'consum',
 'control',
 'cost',
 'currenc',
 'decreas',
 'deflation',
 'demand',
 'determin',
 'discourag',
 'downturn',
 'due',
 'dure',
 'each',
 'econom',
 'economi',
 'economics',
 'economist',
 'economy',
 'effect',
 'enabl',
 'encourag',
 'enough',
 'excess',
 'exchang',
 'factor',
 'faster',
 'favour',
 'fewer',
 'fluctuat',
 'for',
 'freedom',
 'frequently',
 'from',
 'futur',
 'future',
 'gener',
 'generally',
 'given',
 'good',
 'greater',
 'grow',
 'growth',
 'harmful',
 'high',
 'hoard',
 'hoarding',
 'hold',
 'however',
 'hyperinfl',
 'if',
 'in',
 'includ',
 'increas',
 'index',
 'ineffici',
 'inflat',
 'inflation',
 'instead',
 'in

In [17]:
# generate for lemmatized as well

In [19]:
# only alpha

In [35]:
import re
regex = re.compile('[^a-zA-Z]')

In [41]:
alpha_text = regex.sub(' ', document_text)
alpha_text = ' '.join(alpha_text.split())

In [42]:
alpha_text

'In economics inflation or less frequently price inflation is a general rise in the price level of an economy over a period of time When the general price level rises each unit of currency buys fewer goods and services consequently inflation reflects a reduction in the purchasing power per unit of money a loss of real value in the medium of exchange and unit of account within the economy The opposite of inflation is deflation a sustained decrease in the general price level of goods and services The common measure of inflation is the inflation rate the annualised percentage change in a general price index usually the consumer price index over time Economists believe that very high rates of inflation and hyperinflation are harmful and are caused by excessive growth of the money supply Views on which factors determine low to moderate rates of inflation are more varied Low or moderate inflation may be attributed to fluctuations in real demand for goods and services or changes in available 

In [44]:
# remove stop words

In [45]:
# remove stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [54]:
'or' in stop

True

In [59]:
nostop_text = ' '.join(word.lower() for word in alpha_text.lower().split() if word not in stop)

In [60]:
print(nostop_text)

economics inflation less frequently price inflation general rise price level economy period time general price level rises unit currency buys fewer goods services consequently inflation reflects reduction purchasing power per unit money loss real value medium exchange unit account within economy opposite inflation deflation sustained decrease general price level goods services common measure inflation inflation rate annualised percentage change general price index usually consumer price index time economists believe high rates inflation hyperinflation harmful caused excessive growth money supply views factors determine low moderate rates inflation varied low moderate inflation may attributed fluctuations real demand goods services changes available supplies scarcities however consensus view long sustained period inflation caused money supply growing faster rate economic growth inflation affects economies various positive negative ways negative effects inflation include increase opportu

In [61]:
# what happens if you use the original document_text. does it catch all the stop words?

In [62]:
# generate a stemmed, alpha, no stop word list

In [63]:
stem_text =  ' '.join(stemmer.stem(w) for w in nostop_text.split())

In [64]:
stem_text

'econom inflat less frequent price inflat gener rise price level economi period time gener price level rise unit currenc buy fewer good servic consequ inflat reflect reduct purchas power per unit money loss real valu medium exchang unit account within economi opposit inflat deflat sustain decreas gener price level good servic common measur inflat inflat rate annualis percentag chang gener price index usual consum price index time economist believ high rate inflat hyperinfl harm caus excess growth money suppli view factor determin low moder rate inflat vari low moder inflat may attribut fluctuat real demand good servic chang avail suppli scarciti howev consensu view long sustain period inflat caus money suppli grow faster rate econom growth inflat affect economi variou posit neg way neg effect inflat includ increas opportun cost hold money uncertainti futur inflat may discourag invest save inflat rapid enough shortag good consum begin hoard concern price increas futur posit effect inclu

In [65]:
lemma_text = ' '.join(lemmatizer.lemmatize(w) for w in nostop_text.split())

In [66]:
lemma_text

'economics inflation le frequently price inflation general rise price level economy period time general price level rise unit currency buy fewer good service consequently inflation reflects reduction purchasing power per unit money loss real value medium exchange unit account within economy opposite inflation deflation sustained decrease general price level good service common measure inflation inflation rate annualised percentage change general price index usually consumer price index time economist believe high rate inflation hyperinflation harmful caused excessive growth money supply view factor determine low moderate rate inflation varied low moderate inflation may attributed fluctuation real demand good service change available supply scarcity however consensus view long sustained period inflation caused money supply growing faster rate economic growth inflation affect economy various positive negative way negative effect inflation include increase opportunity cost holding money u

In [67]:
stem_vectorizer = CountVectorizer()

In [69]:
stem_vectorizer = CountVectorizer()
stem_vectorizer.fit_transform([stem_text])
stem_vectorizer.get_feature_names()

['account',
 'adjust',
 'affect',
 'allow',
 'annualis',
 'associ',
 'attribut',
 'author',
 'avail',
 'avoid',
 'bank',
 'begin',
 'believ',
 'buy',
 'carri',
 'caus',
 'central',
 'chang',
 'common',
 'concern',
 'consensu',
 'consequ',
 'consum',
 'control',
 'cost',
 'currenc',
 'decreas',
 'deflat',
 'demand',
 'determin',
 'discourag',
 'downturn',
 'due',
 'econom',
 'economi',
 'economist',
 'effect',
 'enabl',
 'encourag',
 'enough',
 'excess',
 'exchang',
 'factor',
 'faster',
 'favour',
 'fewer',
 'fluctuat',
 'freedom',
 'frequent',
 'futur',
 'gener',
 'given',
 'good',
 'greater',
 'grow',
 'growth',
 'harm',
 'high',
 'hoard',
 'hold',
 'howev',
 'hyperinfl',
 'includ',
 'increas',
 'index',
 'ineffici',
 'inflat',
 'instead',
 'interest',
 'invest',
 'keep',
 'labor',
 'less',
 'level',
 'liquid',
 'loan',
 'long',
 'loss',
 'low',
 'market',
 'may',
 'measur',
 'medium',
 'moder',
 'monetari',
 'money',
 'neg',
 'nomin',
 'open',
 'oper',
 'opportun',
 'oppos',
 'oppos