-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1 from djokester/master
Added Lexical Dispersion Plot
- Loading branch information
Showing
1 changed file
with
159 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,159 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Creating a Lexical Dispersion Plot\n", | ||
"\n", | ||
"Given a piece of text, and a list of words, this a lexical dispersion plot locates the occurrence of each of the words in the given text." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# import statements\n", | ||
"import cltk\n", | ||
"from nltk.tokenize import word_tokenize \n", | ||
"from cltk.tokenize.word import WordTokenizer\n", | ||
"from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex as i_word\n", | ||
"import matplotlib.pyplot as plt" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Parameters \n", | ||
"\n", | ||
"We will define a function by the name of ` dispersionPlot(text, words, lang) ` .\n", | ||
"\n", | ||
"parameter text: A text based on which the Lexical Distribution plot is to be drawn\n", | ||
"\n", | ||
"type of text: string\n", | ||
"\n", | ||
"parameter words: A list of words, whose Distribution across the text is plotted.\n", | ||
"\n", | ||
"type of words: list of strings\n", | ||
"\n", | ||
"parameter lang: The ISO 639-1 Language Code of the language in which the text is based on.\n", | ||
"\n", | ||
"type of lang: string\n", | ||
"\n", | ||
"returns void. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def dispersionPlot(text, words, lang):\n", | ||
" languages = [\"en\",\"bn\",\"hi\",\"la\",\"sa\"]\n", | ||
" \"\"\"\n", | ||
" en:English\n", | ||
" bn:Bengali\n", | ||
" hi:Hindi\n", | ||
" la:Latin\n", | ||
" sa:Sanskrit\n", | ||
" \"\"\"\n", | ||
" if lang in languages:\n", | ||
" if lang in [\"en\",\"la\"]:\n", | ||
" tokens = word_tokenize(text.lower())\n", | ||
" for i in range(0,len(words)):\n", | ||
" words[i] = words[i].lower()\n", | ||
" if lang in [\"bn\",\"hi\",\"sa\"]:\n", | ||
" tokens= i_word(text)\n", | ||
" \n", | ||
" # Locating the matches of the words in the text. \n", | ||
" x_length = len(tokens)\n", | ||
" y_length = len(words)\n", | ||
" x_list = []\n", | ||
" y_list = []\n", | ||
" for i in range(0,x_length):\n", | ||
" for j in range(0,y_length):\n", | ||
" if tokens[i]==words[j]:\n", | ||
" x_list.append(i+1)\n", | ||
" y_list.append(j)\n", | ||
" \n", | ||
" #Creation of Dispersion Plot with Matplotlib's pyplot. \n", | ||
" plt.plot(x_list, y_list, \"b|\", scalex=.1)\n", | ||
" plt.yticks(list(range(len(words))), words, color=\"b\")\n", | ||
" plt.ylim(-1, len(words))\n", | ||
" plt.xlabel(\"Lexical Distribution\")\n", | ||
" plt.show()\n", | ||
" \n", | ||
" else:\n", | ||
" print(\"Language not presently covered by CLTK or wrong language code\") \n", | ||
" \n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Explaination \n", | ||
"\n", | ||
"### Tokenisation \n", | ||
"Firstly we check which language the function is present in. Then we try to sort them accordingly, sending the Indian ones one way and the English, Latin cases the other. Both these groups have been assigned their own separte tokenizer. I have used CLTK's Indian Tokenizer for Indian Languages and NLTK's word_tokenize for the other two languages.\n", | ||
"\n", | ||
"### Locating Matches and Plotting\n", | ||
"Pretty simple task where we take matches from the text and locate their positions in the text in order to plot them on the graph. This is achieved by means of simple looping techniques. It is followed by simple plotting on a graph and manipulating data points to give the Lexical Dispersion Plot. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEKCAYAAAAW8vJGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAD8NJREFUeJzt3X2sZHV9x/H3Z3cVEbQWdzGAhgVro1QLrhdaFRVbrQ8x\ntVoMKq3SaInG57ptoRp3tbaRWrFaTJHSBqNU06i0qKkR0BVrY5e7urALFEWEVEFYH6pgEZfl2z/m\n3Hh/t3v3ztynuTP3/Upu7plzzpz5fs/vZj5zzsw9k6pCkqQpa4ZdgCRpZTEYJEkNg0GS1DAYJEkN\ng0GS1DAYJEkNg0GS1DAYJEkNg0GS1Fg37ALmY/369bVx48ZhlyFJI2PHjh3fq6oN/aw7ksGwceNG\nJicnh12GJI2MJLf0u66nkiRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNB\nktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQw\nGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJ\nDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNB\nktQwGCRJjaEGQ8KFCcd10382zFokST1DDYYqXlnFdd3NZQ+GrVuX+xEX32L1MA77YhQsxn6eaxuO\n5fIax/2dqlqeBwqHAP8MPBxYC/w58GpgM3Aq8MfALuDaKk4/0LYmJiZqcnJyMWpimdpfMovVwzjs\ni1GwGPt5rm04lstrVPZ3kh1VNdHPust5xPBs4NYqjq/iscBnpxZUcRZwdxUnzBUKkqSltZzBsAt4\nZsI5CU+p4keD3DnJmUkmk0zu2bNniUqUJC1bMFTxdWATvYB4Z8LbBrt/XVBVE1U1sWHDhiWpUZIE\n65brgRKOBH5QxUcS/gd45YxV9ibcr4q9y1WTJOn/W85TSY8DtifsBLYA75yx/ALgmoSLl6ugLVuW\n65GWzmL1MA77YhQsxn6eaxuO5fIax/29bJ9KWkyL9akkSVotVuqnkiRJI8BgkCQ1DAZJUsNgkCQ1\nDAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJ\nUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNg\nkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1\nDAZJUsNgkCQ1DAZJUmPOYEj4j+UoRJK0MswZDFU8aTkKWU5bt66s7UjSStLPEcNd3e9TEq5M+EzC\nDQnnJ6xJWJtwUcLuhF0Jb+rW35Yw0U2vT7i5mz4j4V8SLku4OeG1CX+U8LWEryQctoT9AvD2t6+s\n7UjSSrJuwPVPAo4DbgE+C7wQ+BZwVBWPBUh4SB/beSzweOABwI3An1bx+IT3Ai8D/mbAuiRJi2TQ\nN5+3V3FTFfuAjwInAzcBxyb8bcKzgR/3sZ0vVHFnFXuAHwGf6ubvAjbu7w5JzkwymWRyz549A5Yt\nSerXoMFQM29X8UPgeGAb8Crgwm7ZvdO2/4AZ97tn2vR9027fxyxHMVV1QVVNVNXEhg0bBixbktSv\nQYPhpIRjEtYApwH/nrAeWFPFJ4C3Apu6dW8GntBNn7oYxUqSlt6gwXAVcB5wPb33Fi4BjgK2JewE\nPgKc3a3718CrE74GrF+cchfHli0razuStJKkaubZoVlWDKcAm6t43pJW1IeJiYmanJwcdhmSNDKS\n7KiqiX7W9T+fJUmNvj+uWsU2em8wS5LGmEcMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSG\nwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJ\nahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgM\nkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJavQdDAl3\nLWUhy2Xr1mFXMJ7cr5qLfyMLt1z7MFXV34rhrioO7XPddVXcu6DKDmBiYqImJyfndd8E+mxZA3C/\nai7+jSzcQvZhkh1VNdHPugOfSkpIwrsTdifsSjitm39KwpcSLgWu6+b9XsL2hJ0JH0xY282/q9vG\ntQmXJ5yUsC3hpoTfHrQmSdLimc97DC8ETgCOB54BvDvhiG7ZJuANVfxywmOA04AnV3ECsA84vVvv\nEODzVfwKcCfwTuCZwAuAd+zvQZOcmWQyyeSePXvmUbYkqR/r5nGfk4GPVrEPuD3hi8CJwI+B7VV8\nq1vvN4EnAFclABwM3NEt+xnw2W56F3BPFXsTdgEb9/egVXUBcAH0TiXNo25JUh/mEwwH8pNp0wE+\nVMXZ+1lvbxVTT+73AfcAVHFfsug1SZIGMJ9TSV8CTktYm7ABeCqwfT/rXQGcmnA4QMJhCUfPv9TF\nsWXLsCsYT+5XzcW/kYVbrn04n1fnlwBPBK4GCviTKr6b8OjpK1VxXcJbgc8lrAH2Aq8BbllgzQvi\nR+aWhvtVc/FvZOFW3MdVV5KFfFxVklajJf24qiRpvBkMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJ\nahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgM\nkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSG\nwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJ\nahgMkqSGwSBJahgMkqSGwSBJaqSqhl3DwJLsAW6Zx13XA99b5HJWCnsbTfY2mkaxt6OrakM/K45k\nMMxXksmqmhh2HUvB3kaTvY2mce4NPJUkSZrBYJAkNVZbMFww7AKWkL2NJnsbTePc2+p6j0GSNLfV\ndsQgSZrDqgiGJM9OckOSG5OcNex6FirJzUl2JdmZZLKbd1iSy5J8o/v9i8Ous19J/jHJHUl2T5s3\naz9Jzu7G8oYkzxpO1f2ZpbetSb7Tjd/OJM+dtmwkekvyiCRfSHJdkmuTvKGbPy7jNlt/Iz92famq\nsf4B1gLfBI4F7g9cDRw37LoW2NPNwPoZ8/4KOKubPgs4Z9h1DtDPU4FNwO65+gGO68bwIOCYbmzX\nDruHAXvbCmzez7oj0xtwBLCpm34Q8PWu/nEZt9n6G/mx6+dnNRwxnATcWFU3VdXPgI8Bzx9yTUvh\n+cCHuukPAb8zxFoGUlVXAj+YMXu2fp4PfKyq7qmqbwE30hvjFWmW3mYzMr1V1W1V9dVu+k7geuAo\nxmfcZutvNiPV31xWQzAcBfz3tNvf5sADPAoKuDzJjiRndvMeVlW3ddPfBR42nNIWzWz9jMt4vi7J\nNd2ppqnTLSPZW5KNwOOB/2QMx21GfzBGYzeb1RAM4+jkqjoBeA7wmiRPnb6wese2Y/Nxs3HrB/g7\neqc2TwBuA94z3HLmL8mhwCeAN1bVj6cvG4dx209/YzN2B7IaguE7wCOm3X54N29kVdV3ut93AJfQ\nO2S9PckRAN3vO4ZX4aKYrZ+RH8+qur2q9lXVfcDf8/NTDiPVW5L70XvSvLiqPtnNHptx219/4zJ2\nc1kNwXAV8KgkxyS5P/Bi4NIh1zRvSQ5J8qCpaeC3gN30enp5t9rLgX8dToWLZrZ+LgVenOSgJMcA\njwK2D6G+eZt64uy8gN74wQj1liTAPwDXV9W50xaNxbjN1t84jF1fhv3u93L8AM+l96mCbwJvGXY9\nC+zlWHqffrgauHaqH+ChwBXAN4DLgcOGXesAPX2U3mH5XnrnZl9xoH6At3RjeQPwnGHXP4/ePgzs\nAq6h94RyxKj1BpxM7zTRNcDO7ue5YzRus/U38mPXz4//+SxJaqyGU0mSpAEYDJKkhsEgSWoYDJKk\nhsEgSWoYDFpxkty1CNt4VZKXLebjJ9nXXVHz2iRXJ3lzkjXdsokk7z/ANjcmeekBlh+Z5OPd9BlJ\nzhuw5jOSHDnt9oVJjhtkG9KUdcMuQFoKVXX+Emz27updioQkhwP/BDwY2FJVk8DkAe67EXhpd59G\nknVVdStw6gJqO4PeP1vdClBVr1zAtrTKecSgkZBkQ5JPJLmq+3lyN/99Sd7WTT8ryZVJ1nTXzd/c\nzf+lJJd3r/K/muSRSQ5NckV3e1eSga64W73LkZwJvDY9pyT5dPd4T5t2vf6vdf+p/i7gKd28N3Wv\n8C9N8nngiu6IYve0h3hEkm3d9xps6bbbrJNkc9fnqcAEcHG3/YO7+050672k63F3knOm3f+uJH/R\n7ZevJBn1Cy9qkRgMGhXvA95bVScCvwtc2M0/GzgtydOB9wN/UL3r2Ex3MfCBqjoeeBK9/0T+KfCC\nqtoEPB14T3cZhL5V1U30vu/j8BmLNgOv6Y4ungLcTe+7Cb5UVSdU1Xu79TYBp1bV0/az+ZO6Pn8V\neNHUk/wsdXyc3tHK6d32755a1p1eOgf4DXoXfjsxydSlsA8BvtLtlyuBP+y/e40zg0Gj4hnAeUl2\n0rsUwYOTHFpV/0vvCe0y4Lyq+ub0O3Wv1o+qqksAquqn3X0C/GWSa+hduuEoFu9S5V8Gzk3yeuAh\nVXXvLOtdVlWzfVfDZVX1/e5J/pP0LtEwHycC26pqT1fHxfS+PAjgZ8Cnu+kd9E53Sb7HoJGxBvj1\nqvrpfpY9Dvg+cOR+ls3mdGAD8ISq2pvkZuABgxSU5FhgH70riD5man5VvSvJZ+hdW+fLmf1rHn9y\ngM3PvFZNAffSvpgbqN792Fs/vybOPnw+UMcjBo2KzwGvm7qRZOpN4KOBN9P7IpXnJPm16Xeq3rdv\nfXvq9El39csHAr8A3NGFwtOBowcpJskG4Hx6Ryk1Y9kjq2pXVZ1D7+q+jwbupPcVkf16Znrfn3ww\nvW9B+zJwO3B4kocmOQh43rT1Z9v+duBpSdYnWQu8BPjiAHVoFfIVglaiByb59rTb5wKvBz7QnfpZ\nB1yZ5NX0Lo28uapuTfIK4KIkJ87Y3u8DH0zyDnpXOX0RvVMqn0qyi975+f/qo66Du1NZ96P36v3D\nXW0zvbELm/voXQH337rpfUmuBi4CfjjHY22n910ADwc+0n3qia6H7fSu9T+95ouA85PcDTxxamZV\n3ZbkLOAL9E6ffaaqRv2S7FpiXl1VktTwVJIkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEw\nSJIa/wfjNdHhbiWeswAAAABJRU5ErkJggg==\n", | ||
"text/plain": [ | ||
"<matplotlib.figure.Figure at 0x7f93bf780390>" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"#Testing the function\n", | ||
"text = \"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras at maximus dui. Sed mauris ipsum, gravida id velit at, lobortis aliquam magna. Nam feugiat nibh eget cursus rutrum. Fusce eu euismod turpis, in posuere elit. In pellentesque massa sit amet sem posuere, vel viverra justo suscipit. Aenean nibh sem, imperdiet nec sem sit amet, maximus euismod velit. Ut vitae ex mauris. Donec laoreet lorem at diam viverra dapibus. Suspendisse elementum rhoncus commodo. Donec massa purus, dignissim maximus laoreet in, pharetra euismod nulla.Nunc eu libero lacus. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi eget tincidunt velit. Curabitur a libero vel felis maximus ultrices. Donec porta fringilla purus eget porttitor. In cursus lobortis sapien, sit amet euismod eros semper quis. Fusce luctus eleifend neque, gravida mollis massa fringilla sit amet. Nunc placerat, purus sit amet maximus sollicitudin, sapien sem suscipit elit, non aliquet nunc nisl in arcu.Quisque eu nisi interdum, pretium elit vel, dignissim est. Ut lobortis vehicula lectus, imperdiet tristique lorem pulvinar at. Phasellus leo justo, tempor at maximus a, vehicula et urna. Nunc blandit eros in dui venenatis placerat. Maecenas vehicula neque orci, at tempor elit vehicula et. Integer elementum, diam nec mattis porttitor, risus nibh vehicula quam, sit amet pellentesque quam ante commodo orci. Etiam sed dignissim tellus. Cras non ultrices velit, eget egestas justo. Ut rutrum condimentum lorem, ut auctor massa dictum eu. Morbi dictum eget eros sed varius. Nunc tristique mollis fermentum. Donec vel odio gravida, fringilla ante a, volutpat dolor. Vestibulum facilisis dictum magna id aliquam. Etiam ex ex, ultricies a dignissim vitae, sollicitudin nec orci. Nam eu augue et libero porttitor maximus volutpat a risus. Suspendisse eget mauris et mi tincidunt suscipit.\"\n", | ||
"words = [\"Lorem\", \"ipsum\", \"sit\"]\n", | ||
"dispersionPlot(text, words, \"la\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.6.0" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |