Skip to content

Commit

Permalink
Merge pull request #1 from djokester/master
Browse files Browse the repository at this point in the history
Added Lexical Dispersion Plot
  • Loading branch information
kylepjohnson committed Jun 5, 2017
2 parents df16632 + 67b6060 commit ed7bb85
Showing 1 changed file with 159 additions and 0 deletions.
159 changes: 159 additions & 0 deletions 9 Lexical Dispersion Plot.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Creating a Lexical Dispersion Plot\n",
"\n",
"Given a piece of text, and a list of words, this a lexical dispersion plot locates the occurrence of each of the words in the given text."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# import statements\n",
"import cltk\n",
"from nltk.tokenize import word_tokenize \n",
"from cltk.tokenize.word import WordTokenizer\n",
"from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex as i_word\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Parameters \n",
"\n",
"We will define a function by the name of ` dispersionPlot(text, words, lang) ` .\n",
"\n",
"parameter text: A text based on which the Lexical Distribution plot is to be drawn\n",
"\n",
"type of text: string\n",
"\n",
"parameter words: A list of words, whose Distribution across the text is plotted.\n",
"\n",
"type of words: list of strings\n",
"\n",
"parameter lang: The ISO 639-1 Language Code of the language in which the text is based on.\n",
"\n",
"type of lang: string\n",
"\n",
"returns void. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def dispersionPlot(text, words, lang):\n",
" languages = [\"en\",\"bn\",\"hi\",\"la\",\"sa\"]\n",
" \"\"\"\n",
" en:English\n",
" bn:Bengali\n",
" hi:Hindi\n",
" la:Latin\n",
" sa:Sanskrit\n",
" \"\"\"\n",
" if lang in languages:\n",
" if lang in [\"en\",\"la\"]:\n",
" tokens = word_tokenize(text.lower())\n",
" for i in range(0,len(words)):\n",
" words[i] = words[i].lower()\n",
" if lang in [\"bn\",\"hi\",\"sa\"]:\n",
" tokens= i_word(text)\n",
" \n",
" # Locating the matches of the words in the text. \n",
" x_length = len(tokens)\n",
" y_length = len(words)\n",
" x_list = []\n",
" y_list = []\n",
" for i in range(0,x_length):\n",
" for j in range(0,y_length):\n",
" if tokens[i]==words[j]:\n",
" x_list.append(i+1)\n",
" y_list.append(j)\n",
" \n",
" #Creation of Dispersion Plot with Matplotlib's pyplot. \n",
" plt.plot(x_list, y_list, \"b|\", scalex=.1)\n",
" plt.yticks(list(range(len(words))), words, color=\"b\")\n",
" plt.ylim(-1, len(words))\n",
" plt.xlabel(\"Lexical Distribution\")\n",
" plt.show()\n",
" \n",
" else:\n",
" print(\"Language not presently covered by CLTK or wrong language code\") \n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explaination \n",
"\n",
"### Tokenisation \n",
"Firstly we check which language the function is present in. Then we try to sort them accordingly, sending the Indian ones one way and the English, Latin cases the other. Both these groups have been assigned their own separte tokenizer. I have used CLTK's Indian Tokenizer for Indian Languages and NLTK's word_tokenize for the other two languages.\n",
"\n",
"### Locating Matches and Plotting\n",
"Pretty simple task where we take matches from the text and locate their positions in the text in order to plot them on the graph. This is achieved by means of simple looping techniques. It is followed by simple plotting on a graph and manipulating data points to give the Lexical Dispersion Plot. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEKCAYAAAAW8vJGAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAD8NJREFUeJzt3X2sZHV9x/H3Z3cVEbQWdzGAhgVro1QLrhdaFRVbrQ8x\ntVoMKq3SaInG57ptoRp3tbaRWrFaTJHSBqNU06i0qKkR0BVrY5e7urALFEWEVEFYH6pgEZfl2z/m\n3Hh/t3v3ztynuTP3/Upu7plzzpz5fs/vZj5zzsw9k6pCkqQpa4ZdgCRpZTEYJEkNg0GS1DAYJEkN\ng0GS1DAYJEkNg0GS1DAYJEkNg0GS1Fg37ALmY/369bVx48ZhlyFJI2PHjh3fq6oN/aw7ksGwceNG\nJicnh12GJI2MJLf0u66nkiRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNB\nktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQw\nGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJ\nDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNBktQwGCRJDYNB\nktQwGCRJjaEGQ8KFCcd10382zFokST1DDYYqXlnFdd3NZQ+GrVuX+xEX32L1MA77YhQsxn6eaxuO\n5fIax/2dqlqeBwqHAP8MPBxYC/w58GpgM3Aq8MfALuDaKk4/0LYmJiZqcnJyMWpimdpfMovVwzjs\ni1GwGPt5rm04lstrVPZ3kh1VNdHPust5xPBs4NYqjq/iscBnpxZUcRZwdxUnzBUKkqSltZzBsAt4\nZsI5CU+p4keD3DnJmUkmk0zu2bNniUqUJC1bMFTxdWATvYB4Z8LbBrt/XVBVE1U1sWHDhiWpUZIE\n65brgRKOBH5QxUcS/gd45YxV9ibcr4q9y1WTJOn/W85TSY8DtifsBLYA75yx/ALgmoSLl6ugLVuW\n65GWzmL1MA77YhQsxn6eaxuO5fIax/29bJ9KWkyL9akkSVotVuqnkiRJI8BgkCQ1DAZJUsNgkCQ1\nDAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJ\nUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNg\nkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1DAZJUsNgkCQ1\nDAZJUsNgkCQ1DAZJUmPOYEj4j+UoRJK0MswZDFU8aTkKWU5bt66s7UjSStLPEcNd3e9TEq5M+EzC\nDQnnJ6xJWJtwUcLuhF0Jb+rW35Yw0U2vT7i5mz4j4V8SLku4OeG1CX+U8LWEryQctoT9AvD2t6+s\n7UjSSrJuwPVPAo4DbgE+C7wQ+BZwVBWPBUh4SB/beSzweOABwI3An1bx+IT3Ai8D/mbAuiRJi2TQ\nN5+3V3FTFfuAjwInAzcBxyb8bcKzgR/3sZ0vVHFnFXuAHwGf6ubvAjbu7w5JzkwymWRyz549A5Yt\nSerXoMFQM29X8UPgeGAb8Crgwm7ZvdO2/4AZ97tn2vR9027fxyxHMVV1QVVNVNXEhg0bBixbktSv\nQYPhpIRjEtYApwH/nrAeWFPFJ4C3Apu6dW8GntBNn7oYxUqSlt6gwXAVcB5wPb33Fi4BjgK2JewE\nPgKc3a3718CrE74GrF+cchfHli0razuStJKkaubZoVlWDKcAm6t43pJW1IeJiYmanJwcdhmSNDKS\n7KiqiX7W9T+fJUmNvj+uWsU2em8wS5LGmEcMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSG\nwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJ\nahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgM\nkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJavQdDAl3\nLWUhy2Xr1mFXMJ7cr5qLfyMLt1z7MFXV34rhrioO7XPddVXcu6DKDmBiYqImJyfndd8E+mxZA3C/\nai7+jSzcQvZhkh1VNdHPugOfSkpIwrsTdifsSjitm39KwpcSLgWu6+b9XsL2hJ0JH0xY282/q9vG\ntQmXJ5yUsC3hpoTfHrQmSdLimc97DC8ETgCOB54BvDvhiG7ZJuANVfxywmOA04AnV3ECsA84vVvv\nEODzVfwKcCfwTuCZwAuAd+zvQZOcmWQyyeSePXvmUbYkqR/r5nGfk4GPVrEPuD3hi8CJwI+B7VV8\nq1vvN4EnAFclABwM3NEt+xnw2W56F3BPFXsTdgEb9/egVXUBcAH0TiXNo25JUh/mEwwH8pNp0wE+\nVMXZ+1lvbxVTT+73AfcAVHFfsug1SZIGMJ9TSV8CTktYm7ABeCqwfT/rXQGcmnA4QMJhCUfPv9TF\nsWXLsCsYT+5XzcW/kYVbrn04n1fnlwBPBK4GCviTKr6b8OjpK1VxXcJbgc8lrAH2Aq8BbllgzQvi\nR+aWhvtVc/FvZOFW3MdVV5KFfFxVklajJf24qiRpvBkMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJ\nahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgM\nkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSG\nwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJahgMkqSGwSBJ\nahgMkqSGwSBJahgMkqSGwSBJaqSqhl3DwJLsAW6Zx13XA99b5HJWCnsbTfY2mkaxt6OrakM/K45k\nMMxXksmqmhh2HUvB3kaTvY2mce4NPJUkSZrBYJAkNVZbMFww7AKWkL2NJnsbTePc2+p6j0GSNLfV\ndsQgSZrDqgiGJM9OckOSG5OcNex6FirJzUl2JdmZZLKbd1iSy5J8o/v9i8Ous19J/jHJHUl2T5s3\naz9Jzu7G8oYkzxpO1f2ZpbetSb7Tjd/OJM+dtmwkekvyiCRfSHJdkmuTvKGbPy7jNlt/Iz92famq\nsf4B1gLfBI4F7g9cDRw37LoW2NPNwPoZ8/4KOKubPgs4Z9h1DtDPU4FNwO65+gGO68bwIOCYbmzX\nDruHAXvbCmzez7oj0xtwBLCpm34Q8PWu/nEZt9n6G/mx6+dnNRwxnATcWFU3VdXPgI8Bzx9yTUvh\n+cCHuukPAb8zxFoGUlVXAj+YMXu2fp4PfKyq7qmqbwE30hvjFWmW3mYzMr1V1W1V9dVu+k7geuAo\nxmfcZutvNiPV31xWQzAcBfz3tNvf5sADPAoKuDzJjiRndvMeVlW3ddPfBR42nNIWzWz9jMt4vi7J\nNd2ppqnTLSPZW5KNwOOB/2QMx21GfzBGYzeb1RAM4+jkqjoBeA7wmiRPnb6wese2Y/Nxs3HrB/g7\neqc2TwBuA94z3HLmL8mhwCeAN1bVj6cvG4dx209/YzN2B7IaguE7wCOm3X54N29kVdV3ut93AJfQ\nO2S9PckRAN3vO4ZX4aKYrZ+RH8+qur2q9lXVfcDf8/NTDiPVW5L70XvSvLiqPtnNHptx219/4zJ2\nc1kNwXAV8KgkxyS5P/Bi4NIh1zRvSQ5J8qCpaeC3gN30enp5t9rLgX8dToWLZrZ+LgVenOSgJMcA\njwK2D6G+eZt64uy8gN74wQj1liTAPwDXV9W50xaNxbjN1t84jF1fhv3u93L8AM+l96mCbwJvGXY9\nC+zlWHqffrgauHaqH+ChwBXAN4DLgcOGXesAPX2U3mH5XnrnZl9xoH6At3RjeQPwnGHXP4/ePgzs\nAq6h94RyxKj1BpxM7zTRNcDO7ue5YzRus/U38mPXz4//+SxJaqyGU0mSpAEYDJKkhsEgSWoYDJKk\nhsEgSWoYDFpxkty1CNt4VZKXLebjJ9nXXVHz2iRXJ3lzkjXdsokk7z/ANjcmeekBlh+Z5OPd9BlJ\nzhuw5jOSHDnt9oVJjhtkG9KUdcMuQFoKVXX+Emz27updioQkhwP/BDwY2FJVk8DkAe67EXhpd59G\nknVVdStw6gJqO4PeP1vdClBVr1zAtrTKecSgkZBkQ5JPJLmq+3lyN/99Sd7WTT8ryZVJ1nTXzd/c\nzf+lJJd3r/K/muSRSQ5NckV3e1eSga64W73LkZwJvDY9pyT5dPd4T5t2vf6vdf+p/i7gKd28N3Wv\n8C9N8nngiu6IYve0h3hEkm3d9xps6bbbrJNkc9fnqcAEcHG3/YO7+050672k63F3knOm3f+uJH/R\n7ZevJBn1Cy9qkRgMGhXvA95bVScCvwtc2M0/GzgtydOB9wN/UL3r2Ex3MfCBqjoeeBK9/0T+KfCC\nqtoEPB14T3cZhL5V1U30vu/j8BmLNgOv6Y4ungLcTe+7Cb5UVSdU1Xu79TYBp1bV0/az+ZO6Pn8V\neNHUk/wsdXyc3tHK6d32755a1p1eOgf4DXoXfjsxydSlsA8BvtLtlyuBP+y/e40zg0Gj4hnAeUl2\n0rsUwYOTHFpV/0vvCe0y4Lyq+ub0O3Wv1o+qqksAquqn3X0C/GWSa+hduuEoFu9S5V8Gzk3yeuAh\nVXXvLOtdVlWzfVfDZVX1/e5J/pP0LtEwHycC26pqT1fHxfS+PAjgZ8Cnu+kd9E53Sb7HoJGxBvj1\nqvrpfpY9Dvg+cOR+ls3mdGAD8ISq2pvkZuABgxSU5FhgH70riD5man5VvSvJZ+hdW+fLmf1rHn9y\ngM3PvFZNAffSvpgbqN792Fs/vybOPnw+UMcjBo2KzwGvm7qRZOpN4KOBN9P7IpXnJPm16Xeq3rdv\nfXvq9El39csHAr8A3NGFwtOBowcpJskG4Hx6Ryk1Y9kjq2pXVZ1D7+q+jwbupPcVkf16Znrfn3ww\nvW9B+zJwO3B4kocmOQh43rT1Z9v+duBpSdYnWQu8BPjiAHVoFfIVglaiByb59rTb5wKvBz7QnfpZ\nB1yZ5NX0Lo28uapuTfIK4KIkJ87Y3u8DH0zyDnpXOX0RvVMqn0qyi975+f/qo66Du1NZ96P36v3D\nXW0zvbELm/voXQH337rpfUmuBi4CfjjHY22n910ADwc+0n3qia6H7fSu9T+95ouA85PcDTxxamZV\n3ZbkLOAL9E6ffaaqRv2S7FpiXl1VktTwVJIkqWEwSJIaBoMkqWEwSJIaBoMkqWEwSJIaBoMkqWEw\nSJIa/wfjNdHhbiWeswAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f93bf780390>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#Testing the function\n",
"text = \"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras at maximus dui. Sed mauris ipsum, gravida id velit at, lobortis aliquam magna. Nam feugiat nibh eget cursus rutrum. Fusce eu euismod turpis, in posuere elit. In pellentesque massa sit amet sem posuere, vel viverra justo suscipit. Aenean nibh sem, imperdiet nec sem sit amet, maximus euismod velit. Ut vitae ex mauris. Donec laoreet lorem at diam viverra dapibus. Suspendisse elementum rhoncus commodo. Donec massa purus, dignissim maximus laoreet in, pharetra euismod nulla.Nunc eu libero lacus. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Morbi eget tincidunt velit. Curabitur a libero vel felis maximus ultrices. Donec porta fringilla purus eget porttitor. In cursus lobortis sapien, sit amet euismod eros semper quis. Fusce luctus eleifend neque, gravida mollis massa fringilla sit amet. Nunc placerat, purus sit amet maximus sollicitudin, sapien sem suscipit elit, non aliquet nunc nisl in arcu.Quisque eu nisi interdum, pretium elit vel, dignissim est. Ut lobortis vehicula lectus, imperdiet tristique lorem pulvinar at. Phasellus leo justo, tempor at maximus a, vehicula et urna. Nunc blandit eros in dui venenatis placerat. Maecenas vehicula neque orci, at tempor elit vehicula et. Integer elementum, diam nec mattis porttitor, risus nibh vehicula quam, sit amet pellentesque quam ante commodo orci. Etiam sed dignissim tellus. Cras non ultrices velit, eget egestas justo. Ut rutrum condimentum lorem, ut auctor massa dictum eu. Morbi dictum eget eros sed varius. Nunc tristique mollis fermentum. Donec vel odio gravida, fringilla ante a, volutpat dolor. Vestibulum facilisis dictum magna id aliquam. Etiam ex ex, ultricies a dignissim vitae, sollicitudin nec orci. Nam eu augue et libero porttitor maximus volutpat a risus. Suspendisse eget mauris et mi tincidunt suscipit.\"\n",
"words = [\"Lorem\", \"ipsum\", \"sit\"]\n",
"dispersionPlot(text, words, \"la\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit ed7bb85

Please sign in to comment.