Source of articles: Times of India Archives
Start Date of articles: 1st January, 2016
End Date of articles: 31st January, 2016
- Get the articles and analyse the frequency of words used.
- Later we extended it to bigrams.
- Downloaded articles from Times of India archives
- Tokenized the articles
- Stored all the words and pair of consecutive words for each category
- Calculated Pointwise Mutual Information (PMI) for all pairs of consecutive words
- NodeJS: for downloading articles and calculating PMI
- Python: for tokenizing and storing the words and the pair of consecutive words
- MongoDB: database used to store data
Porbability of finding the word W:
P(W) = count(W)/(sum of all frequencies of words)
Porbability of finding the bigram (Wi,Wi-1):
P(Wi,Wi-1) =
count(Wi,Wi-1)/(sum of all frequencies of consecutive words)
PMI:
PMI(Wi,Wi-1) = log( P(Wi,Wi-1)/( P(Wi)P(Wi-1) ) )
- Redirection of links for the articles resulting in empty response.
- Used
follow-redirectsmodule do fix redirection problem.
- Used
- Speed of downloading articles and parsing the words and bigrams for calculations.
- For downloading articles, we ran 4 servers at a time with 2 threads on each server.
- For parsing words and bigrams, we ran 4 servers at a time with one category on each server.
- Getting proper bigrams with PMI.
- We set a threshold frequency for bigrams for each category.
- City
- Most frequently used words and bigrams were related to
- Politics
- Crime
- Money
- Most frequently used words and bigrams were related to
- India
- Most frequently used words and bigrams were related to
- Politics
- Crime
- Terrorist attacks
- Most frequently used words and bigrams were related to
- Life
- Most frequently used words and bigrams were related to
- Health
- Diseases
- Diet
- Most frequently used words and bigrams were related to
- World
- Most frequently occuring country/city names
- China
- United States
- North Korea
- Saudi Arabia
- New York
- Most frequently occuring words and bigrams were related to
- Politics
- Terrorism
- Most frequently occuring country/city names
- Business
- Most frequently used words and bigrams were related to
- Money
- Stock Market
- Banking
- Petroleum
- Development
- Most frequently used words and bigrams were related to
| Category | Number of articles |
|---|---|
| City | 11,137 |
| India | 1,157 |
| Life | 932 |
| World | 563 |
| Business | 464 |
Filters for the table
Min PMI: 9
Min frequency: 50
| S. No. | Bigram | PMI |
|---|---|---|
| 1 | modus operandi | 11.7213 |
| 2 | prima facie | 11.4713 |
| 3 | saudi arabia | 11.3766 |
| 4 | wi fi | 11.3463 |
| 5 | smriti irani | 11.3285 |
| 6 | aam aadmi | 11.1529 |
| 7 | bullock cart | 11.0430 |
| 8 | bone marrow | 10.9561 |
| 9 | jawaharlal nehru | 10.8396 |
| 10 | bharatiya janata | 10.7816 |
| 11 | swine flu | 10.6364 |
| 12 | sri lanka | 10.5937 |
| 13 | oommen chandy | 10.4838 |
| 14 | makar sankranti | 10.4699 |
| 15 | chandrababu naidu | 10.4163 |
| 16 | pimpri chinchwad | 10.4076 |
| 17 | jd u | 10.3519 |
| 18 | mamata banerjee | 10.3157 |
| 19 | mehbooba mufti | 10.2770 |
| 20 | naveen patnaik | 10.2601 |
| 21 | devendra fadnavis | 10.1470 |
| 22 | penal code | 10.1022 |
| 23 | swachh bharat | 10.0862 |
| 24 | freedom fighter | 10.0715 |
| 25 | slum dweller | 10.0346 |
| 26 | rajya sabha | 10.0135 |
| 27 | shiv sena | 9.9820 |
| 28 | lok sabha | 9.8566 |
| 29 | rohith vemula | 9.8273 |
| 30 | sq ft | 9.6569 |
| 31 | western disturbance | 9.6211 |
| 32 | story offline | 9.6129 |
| 33 | arvind kejriwal | 9.5935 |
| 34 | chinese manjha | 9.5800 |
| 35 | dense fog | 9.5551 |
| 36 | birth anniversary | 9.5337 |
| 37 | renewable energy | 9.4975 |
| 38 | tribunal ngt | 9.4840 |
| 39 | mahatma gandhi | 9.4766 |
| 40 | tamil nadu | 9.4739 |
| 41 | manohar lal | 9.4205 |
| 42 | cctv footage | 9.4136 |
| 43 | appa rao | 9.3603 |
| 44 | vice chancellor | 9.3048 |
| 45 | j jayalalithaa | 9.2945 |
| 46 | cold wave | 9.2788 |
| 47 | writ petition | 9.2346 |
| 48 | sim card | 9.2013 |
| 49 | real estate | 9.1993 |
| 50 | boundary wall | 9.1423 |
| 51 | square yard | 9.1422 |
| 52 | stray dog | 9.1354 |
| 53 | narendra modis | 9.1086 |
| 54 | ration card | 9.0813 |
| 55 | cctv camera | 9.0442 |
| 56 | indira gandhi | 9.0407 |
| 57 | animal husbandry | 9.0384 |
| S. No. | Bigram | Frequency |
|---|---|---|
| 1 | r crore | 2593 |
| 2 | state government | 2026 |
| 3 | chief minister | 1982 |
| 4 | year old | 1835 |
| 5 | police station | 1679 |
| 6 | r lakh | 1350 |
| 7 | official said | 1334 |
| 8 | high court | 1314 |
| 9 | police said | 1064 |
| 10 | told toi | 1014 |
| 11 | source said | 1012 |
| 12 | new delhi | 948 |
| 13 | municipal corporation | 844 |
| 14 | civic body | 765 |
| 15 | prime minister | 571 |
| 16 | new year | 545 |
| 17 | police officer | 508 |
| 18 | tamil nadu | 499 |
| 19 | year ago | 494 |
| 20 | degree celsius | 471 |
| S. No. | Word | Frequency |
|---|---|---|
| 1 | said | 33144 |
| 2 | police | 13682 |
| 3 | year | 10921 |
| 4 | state | 10302 |
| 5 | government | 9386 |
| 6 | city | 8307 |
| 7 | r | 7566 |
| 8 | day | 6818 |
| 9 | official | 6404 |
| 10 | case | 5638 |
| 11 | people | 5427 |
| 12 | student | 5423 |
| 13 | minister | 5291 |
| 14 | time | 5148 |
| 15 | road | 5119 |
| 16 | district | 5007 |
| 17 | area | 4964 |
| 18 | court | 4775 |
| 19 | department | 4558 |
| 20 | new | 4474 |
Filters for the table
Min PMI: 9
Min frequency: 10
| S. No. | Bigram | PMI |
|---|---|---|
| 1 | wi fi | 10.5464 |
| 2 | saudi arabia | 10.2781 |
| 3 | barack obama | 10.2781 |
| 4 | lone wolf | 10.1156 |
| 5 | aam aadmi | 10.1032 |
| 6 | dipak misra | 10.0716 |
| 7 | ghulam nabi | 10.0668 |
| 8 | ardh kumbh | 9.9690 |
| 9 | bullock cart | 9.8155 |
| 10 | nobel laureate | 9.8127 |
| 11 | nicobar island | 9.7972 |
| 12 | mukul rohatgi | 9.7857 |
| 13 | shafi armar | 9.7817 |
| 14 | sitaram yechury | 9.7441 |
| 15 | terminally ill | 9.7101 |
| 16 | swami vivekananda | 9.6485 |
| 17 | sri lanka | 9.6263 |
| 18 | vikas swarup | 9.5968 |
| 19 | col niranjan | 9.5850 |
| 20 | bharatiya janata | 9.5348 |
| 21 | jawaharlal nehru | 9.5232 |
| 22 | kapil sibal | 9.5204 |
| 23 | passive euthanasia | 9.4870 |
| 24 | jet airway | 9.4738 |
| 25 | suresh prabhu | 9.4452 |
| 26 | madan gopal | 9.3535 |
| 27 | environmental clearance | 9.3484 |
| 28 | swachh bharat | 9.3438 |
| 29 | mamata banerjee | 9.3337 |
| 30 | arab league | 9.3167 |
| 31 | maulana masood | 9.2634 |
| 32 | lie detector | 9.2066 |
| 33 | nitin gadkari | 9.1917 |
| 34 | oommen chandy | 9.1795 |
| 35 | sexual harassment | 9.1737 |
| 36 | cook madan | 9.1616 |
| 37 | venkaiah naidu | 9.1453 |
| 38 | jd u | 9.1411 |
| 39 | sushma swaraj | 9.1361 |
| 40 | ford foundation | 9.0731 |
| 41 | nabam tuki | 9.0266 |
| 42 | masood azhar | 9.0005 |
| S. No. | Bigram | Frequency |
|---|---|---|
| 1 | new delhi | 710 |
| 2 | prime minister | 336 |
| 3 | chief minister | 294 |
| 4 | source said | 237 |
| 5 | r crore | 236 |
| 6 | narendra modi | 223 |
| 7 | supreme court | 176 |
| 8 | minister narendra | 170 |
| 9 | official said | 162 |
| 10 | terror attack | 150 |
| 11 | air force | 147 |
| 12 | state government | 137 |
| 13 | high court | 121 |
| 14 | told toi | 118 |
| 15 | pathankot attack | 106 |
| 16 | republic day | 104 |
| 17 | tamil nadu | 95 |
| 18 | chief justice | 94 |
| 19 | west bengal | 92 |
| 20 | security force | 91 |
| S. No. | Words | Frequency |
|---|---|---|
| 1 | said | 3802 |
| 2 | government | 1702 |
| 3 | india | 1556 |
| 4 | minister | 1343 |
| 5 | state | 1231 |
| 6 | delhi | 1145 |
| 7 | new | 1088 |
| 8 | year | 1073 |
| 9 | party | 814 |
| 10 | court | 775 |
| 11 | attack | 759 |
| 12 | day | 746 |
| 13 | congress | 744 |
| 14 | pakistan | 720 |
| 15 | police | 716 |
| 16 | country | 713 |
| 17 | bjp | 711 |
| 18 | chief | 693 |
| 19 | indian | 666 |
| 20 | people | 618 |
Filters for the table
Min PMI: 6.2
Min frequency: 30
| S. No. | Bigram | PMI |
|---|---|---|
| 1 | omega fatty | 8.5011 |
| 2 | bone marrow | 8.3798 |
| 3 | fatty acid | 7.9130 |
| 4 | olive oil | 7.4450 |
| 5 | zika virus | 7.4041 |
| 6 | social medium | 7.1637 |
| 7 | daily mirror | 7.1179 |
| 8 | basmati rice | 7.1105 |
| 9 | lucky colour | 6.9773 |
| 10 | brown rice | 6.7669 |
| 11 | dr jenkins | 6.7401 |
| 12 | heart attack | 6.6881 |
| 13 | vitamin d | 6.6828 |
| 14 | home remedy | 6.5736 |
| 15 | blood circulation | 6.5348 |
| 16 | blood pressure | 6.4470 |
| 17 | calorie intake | 6.4232 |
| 18 | vitamin c | 6.3964 |
| 19 | weight gain | 6.3806 |
| 20 | green tea | 6.3775 |
| 21 | weight loss | 6.3195 |
| 22 | long term | 6.2300 |
| 23 | junk food | 6.2265 |
| S. No. | Bigram | Frequency |
|---|---|---|
| 1 | make sure | 148 |
| 2 | weight loss | 100 |
| 3 | blood pressure | 94 |
| 4 | year old | 93 |
| 5 | heart disease | 90 |
| 6 | health benefit | 80 |
| 7 | zika virus | 73 |
| 8 | say dr | 63 |
| 9 | vitamin c | 63 |
| 10 | fatty acid | 62 |
| 11 | new year | 60 |
| 12 | type diabetes | 59 |
| 13 | brown rice | 58 |
| 14 | vitamin d | 58 |
| 15 | new study | 51 |
| 16 | year ago | 50 |
| 17 | social medium | 48 |
| 18 | long term | 46 |
| 19 | dont want | 45 |
| 20 | omega fatty | 44 |
| S. No. | Word | Frequency |
|---|---|---|
| 1 | time | 1252 |
| 2 | make | 1199 |
| 3 | like | 1069 |
| 4 | help | 1037 |
| 5 | say | 997 |
| 6 | people | 918 |
| 7 | body | 915 |
| 8 | year | 873 |
| 9 | said | 835 |
| 10 | day | 812 |
| 11 | food | 782 |
| 12 | skin | 772 |
| 13 | woman | 687 |
| 14 | child | 669 |
| 15 | study | 669 |
| 16 | health | 658 |
| 17 | just | 649 |
| 18 | new | 648 |
| 19 | good | 637 |
| 20 | life | 622 |
Filters for the table
Min PMI: 6
Min frequency: 30
| S. No. | Bigram | PMI |
|---|---|---|
| 1 | hong kong | 8.9180 |
| 2 | asylum seeker | 8.8153 |
| 3 | hydrogen bomb | 7.4119 |
| 4 | u s | 7.3852 |
| 5 | middle east | 7.3512 |
| 6 | prime minister | 7.1277 |
| 7 | hillary clinton | 7.0899 |
| 8 | john kerry | 7.0754 |
| 9 | saudi arabia | 6.9955 |
| 10 | fox news | 6.9790 |
| 11 | al qaida | 6.9642 |
| 12 | barack obama | 6.9313 |
| 13 | human right | 6.9183 |
| 14 | air strike | 6.6930 |
| 15 | white house | 6.6814 |
| 16 | zika virus | 6.5728 |
| 17 | news agency | 6.5517 |
| 18 | social medium | 6.5504 |
| 19 | told reuters | 6.4148 |
| 20 | told afp | 6.4055 |
| 21 | security council | 6.3453 |
| 22 | donald trump | 6.3247 |
| 23 | told reporter | 6.3221 |
| 24 | president barack | 6.3162 |
| 25 | foreign ministry | 6.2998 |
| 26 | south carolina | 6.2739 |
| 27 | presidential candidate | 6.1728 |
| 28 | new hampshire | 6.0816 |
| 29 | new york | 6.0688 |
| S. No. | Bigram | Frequency |
|---|---|---|
| 1 | united state | 240 |
| 2 | north korea | 235 |
| 3 | saudi arabia | 157 |
| 4 | new york | 156 |
| 5 | islamic state | 144 |
| 6 | official said | 115 |
| 7 | year old | 101 |
| 8 | south korea | 95 |
| 9 | prime minister | 87 |
| 10 | white house | 85 |
| 11 | u s | 79 |
| 12 | human right | 68 |
| 13 | new year | 64 |
| 14 | donald trump | 61 |
| 15 | news agency | 56 |
| 16 | north korean | 55 |
| 17 | new hampshire | 55 |
| 18 | nuclear test | 53 |
| 19 | security force | 52 |
| 20 | hillary clinton | 51 |
| S. No. | Word | Frequency |
|---|---|---|
| 1 | said | 2361 |
| 2 | state | 886 |
| 3 | year | 791 |
| 4 | people | 687 |
| 5 | new | 631 |
| 6 | country | 524 |
| 7 | north | 465 |
| 8 | official | 453 |
| 9 | attack | 449 |
| 10 | group | 447 |
| 11 | government | 425 |
| 12 | time | 413 |
| 13 | president | 413 |
| 14 | china | 410 |
| 15 | trump | 392 |
| 16 | nuclear | 386 |
| 17 | korea | 349 |
| 18 | iran | 330 |
| 19 | told | 326 |
| 20 | force | 317 |
Filters for the table
Min PMI: 9
Min frequency: 5
| S. No. | Bigram | PMI |
|---|---|---|
| 1 | hero motocorp | 10.6154 |
| 2 | san francisco | 10.6154 |
| 3 | mscis broadest | 10.4331 |
| 4 | gen ze | 10.4331 |
| 5 | grama panchayat | 10.4331 |
| 6 | thomas cook | 10.2789 |
| 7 | jio infocomm | 10.1454 |
| 8 | silicon valley | 10.1454 |
| 9 | texas intermediate | 10.1248 |
| 10 | sukanya samriddhi | 10.0276 |
| 11 | tamil nadu | 9.9222 |
| 12 | rajya sabha | 9.9222 |
| 13 | nirmala sitharaman | 9.9222 |
| 14 | coca cola | 9.8269 |
| 15 | circuit breaker | 9.8269 |
| 16 | narayana hrudayalaya | 9.7399 |
| 17 | saudi arabia | 9.5858 |
| 18 | arundhati bhattacharya | 9.5858 |
| 19 | viral shot | 9.5858 |
| 20 | l ampt | 9.5646 |
| 21 | germany dax | 9.5576 |
| 22 | sq ft | 9.5168 |
| 23 | patanjali ayurved | 9.5168 |
| 24 | infinite analytics | 9.4905 |
| 25 | losing streak | 9.3344 |
| 26 | blue chip | 9.2664 |
| 27 | hang seng | 9.2291 |
| 28 | raw material | 9.2291 |
| 29 | jan dhan | 9.2291 |
| 30 | angel broking | 9.2291 |
| 31 | morgan stanley | 9.1803 |
| 32 | jp morgan | 9.1803 |
| 33 | intermediate wti | 9.1803 |
| 34 | somnath temple | 9.1521 |
| 35 | dedicated freight | 9.1490 |
| 36 | app click | 9.1468 |
| 37 | mercedes benz | 9.1338 |
| 38 | dual mode | 9.1338 |
| 39 | bharti airtel | 9.1209 |
| 40 | poll conducted | 9.1158 |
| 41 | aditya birla | 9.1113 |
| 42 | kongs hang | 9.0956 |
| 43 | shree cement | 9.0893 |
| 44 | sun pharma | 9.0749 |
| 45 | generic medicine | 9.0749 |
| 46 | latin america | 9.0551 |
| 47 | dhan yojana | 9.0468 |
| 48 | freight corridor | 9.0313 |
| 49 | electrified route | 9.0059 |
| S. No. | Bigram | Frequency |
|---|---|---|
| 1 | r crore | 443 |
| 2 | new delhi | 158 |
| 3 | r lakh | 98 |
| 4 | oil price | 92 |
| 5 | stock market | 78 |
| 6 | u s | 67 |
| 7 | year ago | 63 |
| 8 | early trade | 60 |
| 9 | central bank | 59 |
| 10 | s ampp | 58 |
| 11 | lakh crore | 55 |
| 12 | net profit | 49 |
| 13 | managing director | 49 |
| 14 | long term | 48 |
| 15 | official said | 47 |
| 16 | crude oil | 44 |
| 17 | source said | 43 |
| 18 | mutual fund | 42 |
| 19 | emerging market | 39 |
| 20 | bse sensex | 39 |
| S. No. | Word | Frequency |
|---|---|---|
| 1 | said | 1546 |
| 2 | year | 1105 |
| 3 | india | 997 |
| 4 | market | 898 |
| 5 | company | 775 |
| 6 | r | 741 |
| 7 | bank | 644 |
| 8 | crore | 532 |
| 9 | new | 510 |
| 10 | government | 500 |
| 11 | growth | 476 |
| 12 | cent | 420 |
| 13 | price | 411 |
| 14 | investor | 410 |
| 15 | investment | 378 |
| 16 | global | 360 |
| 17 | rate | 353 |
| 18 | time | 351 |
| 19 | fund | 329 |
| 20 | china | 326 |