@@ -0,0 +1,316 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploring Tf-IDF\n",
"\n",
"In this notebook you will be exploring the computation of the Tf-IDF feature using a very popular dataset called 20 newsgroups.\n",
"\n",
"The resources you should use to complete this notebook are:\n",
"1. https://en.wikipedia.org/wiki/Tf%E2%80%93idf\n",
"2. http://www.tfidf.com/"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the 20 newsgroups by date dataset\n",
"Number of posts 11314\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/yuzhong/anaconda2/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.\n",
" warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYkAAAEPCAYAAAC3NDh4AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGKpJREFUeJzt3X2wXXV97/H3JzxdQImoJWmJAioXg2NVBoNetEatIFUT\nLlMjjjoo1vEOtjJXb4fEFhPH3ip11EtV2mKRxg6KkSuCD5WIcKzeKmBBQYK5qb2hyJCDVmt5UErM\n9/6x15GdmHWyzz7ZZ+998n7N7Mna66zfWt+zs8/+7N96+K1UFZIk7c6CYRcgSRpdhoQkqZUhIUlq\nZUhIkloZEpKkVoaEJKnVwEMiycIkn05yR5Lbk5yU5PAkG5NsTnJNkoVdy69JsqVZ/pRB1ydJajcX\nPYkLgS9W1VLgGcD3gNXAtVV1HHAdsAYgyfHAKmApcBpwUZLMQY2SpN0YaEgkOQx4flVdClBV26vq\np8BKYH2z2Hrg9GZ6BXB5s9xWYAuwbJA1SpLaDboncQzwoySXJrk5ycVJDgEWVdUkQFVtA45olj8S\nuKur/d3NPEnSEAw6JPYHTgA+UlUnAA/Q2dW061ggjg0iSSNo/wGv/wfAXVX1reb5/6YTEpNJFlXV\nZJLFwL3Nz+8GntDVfkkzbydJDBVJ6kNVzeg470B7Es0upbuS/Odm1ouB24Grgdc3884CrmqmrwbO\nTHJgkmOApwA3tqzbx156rF27dug1zOTRvAP6fBy0N97Ze3isHdj2Fy06auiv/9z+3+36Wvq3P/vX\nf2YG3ZMAeCtwWZIDgH8G3gDsB2xIcjZwJ50zmqiqTUk2AJuAh4Fzqt/fTK0WLz6ayck7d5r3rne9\nq+f2ixYdxbZtW/fq9ufOQ8xu7+ZsT7ab3fYnJ2e3/dm+9gsWHMKOHQ/OqgaNl4zjZ3CSoWbHsP/Q\nZvsh3TmruPv1W9c8el5D399Kdr/9Ga9hFu3nYtvraH89Z7v9/0QnaGZjWK99P+3XsfNrObv33r4u\nCTXD3U1z0ZOYdzoB0f8bdceO2f2hzfbb5K9avpfXt69bPsB1D7snNNeWD7uAfZ49if62z7C/jQ33\nm/w4f5sd/v+d7Yf33t/X9dOT2CfHblq8+GiS9P3Q1LfZfh+SxsU+2ZMYhZ7AePckxrn9ONdue3sS\ns+MxiX3GQfZoJM0JQ2Is7WsHLyUNyz55TEKS1BtDQpLUypCQJLUyJCTtM2Z7+vvixUcP+1eYc54C\n298abD+27ce5dtuPwunf4/iZOcWL6SRJe5UhIUlqZUhIkloZEpKkVoaEJKmVw3JIGiOOWzbXDAlJ\nY8Rxy+aau5skSa0MCUlSK0NCktTKkJAktTIkJEmtDAlJUitDQpLUypCQJLUyJCRJrQwJSVKrgYdE\nkq1JvpPkliQ3NvMOT7IxyeYk1yRZ2LX8miRbktyR5JRB1ydJajcXPYkdwPKqelZVLWvmrQaurarj\ngOuANQBJjgdWAUuB04CL4mhekjQ0cxES2c12VgLrm+n1wOnN9Arg8qraXlVbgS3AMiRJQzEXIVHA\nl5PclOT3mnmLqmoSoKq2AUc0848E7upqe3czT5I0BHMxVPjJVXVPkl8DNibZzK+O9TubsX8lSQMy\n8JCoqnuaf3+Y5LN0dh9NJllUVZNJFgP3NovfDTyhq/mSZt6vWLdu3S+nly9fzvLly/d+8ZI0xiYm\nJpiYmJjVOlI1uC/xSQ4BFlTV/UkOBTYC7wJeDPy4qi5Ich5weFWtbg5cXwacRGc305eBY2uXIpPs\nOmumdTH7G5fYfjzbj3Ptth+F9oP8zBy0JFTVjE4GGnRPYhFwZZJqtnVZVW1M8i1gQ5KzgTvpnNFE\nVW1KsgHYBDwMnDOrNJAkzcpAexKDYk/C9vYkbD+s9uP4mTmln56EV1xLkloZEpKkVnNxCuxAfOUr\nXxl2CZI0743tMYmFC1/UV9vt2/+VBx74DsPer2l7j0nYfjzbj+Nn5pR+jkmMbUj0/x+9ETiVYb/R\nbG9I2H4824/jZ+YUD1xLkvYqQ0KS1MqQkCS1MiQkSa0MCUnq2UEk6euxePHRwy6+L2N7nYQkzb2H\n6PfsqMnJ8bzJpj0JSVIrQ0KS1MqQkCS1MiQkSa0MCUlSK0NCktTKkJAktTIkJEmtDAlJUitDQpLU\nypCQJLUyJCRJrQwJSVIrQ0KS1MqQkCS1MiQkSa0MCUlSK0NCktRqTkIiyYIkNye5unl+eJKNSTYn\nuSbJwq5l1yTZkuSOJKfMRX2SpN2bq57EucCmruergWur6jjgOmANQJLjgVXAUuA04KIk43ljWEma\nB/YYEklemeTRzfQfJ/lMkhN63UCSJcDvAH/dNXslsL6ZXg+c3kyvAC6vqu1VtRXYAizrdVuSpL2r\nl57E+VV1X5LnAb8NXAL8xQy28UHgD4HqmreoqiYBqmobcEQz/0jgrq7l7m7mSZKGYP8elvlF8+/L\ngIur6gtJ/qSXlSd5GTBZVd9OsnyaRWuan7VY1zW9vHlIkqZMTEwwMTExq3WkavrP5ySfp/ON/iXA\nCcDPgBur6hl7XHnyp8Brge3AwcCjgSuBE4HlVTWZZDFwfVUtTbIaqKq6oGn/JWBtVd2wy3qrr1wB\nYCNwKv23B4jtx7b9ONdu+/FuH/b0eTtoSaiqGR3n7WV30yrgGuDUqvo34LF0dh/tUVW9o6qeWFVP\nAs4Erquq1wGfA17fLHYWcFUzfTVwZpIDkxwDPAW4sddfRpK0d/USEn9VVZ+pqi0AVXUP8LpZbve9\nwEuSbAZe3DynqjYBG+icCfVF4JwadvRK0j6sl91NN1fVCV3P9wNuq6rjB13cNDW5u8n2Y7ht2+/b\n7efZ7qbmorb7gN9M8u/N4z7gXh7ZPSRJmsd66Um8p6rWzFE9PbEnYXt7ErYfv/bzrCfR5fNJDm02\n8NokH0hyVF8VSpLGSi8h8RfAg0meAbwd+D7w8YFWJUkaCb2ExPbmDKOVwIer6iN0rneQJPXsIJL0\n/Vi8+OihVN3LFdf3JVlD57TX5ydZABww2LIkab55iNkcD5mcHM5Yp730JF5F57c7uxlnaQnwvoFW\nJUkaCXsMiSYYLgMWJnk58POq8piEJO0DehkqfBWdoTFeSWeIjhuS/O6gC5MkDV8vxyT+CHh2Vd0L\nkOTXgGuBKwZZmCRp+Ho5JrFgKiAa/9pjO0nSmOulJ/GlJNcAn2yev4rO4HuSpHlujyFRVX+Y5Azg\nec2si6vqysGWJUkaBdOGRJLT6dzT4baqetvclCRJGhXTjQJ7EfDfgccB705y/pxVJUkaCdP1JH4L\neEZV/SLJIcDXgHfPTVmSpFEw3VlK/1FVvwCoqgfpjJErSdqHTNeTeGqSW5vpAE9ungeoqvrNgVcn\nSRqq6UJi6ZxVIUkaSa0hUVV3zmUhkqTR45XTkqRWhoQkqdV010l8pfn3grkrR5I0SqY7cP3rSf4L\nsCLJ5exyCmxV3TzQyiRJQzddSLwTOJ/Oneg+sMvPCnjRoIqSJI2G6c5uugK4Isn5VeWV1pK0D+pl\nFNh3J1lBZ5gOgImq+vxgy5IkjYJebl/6HuBcYFPzODfJnw66MEnS8PVyCuzLgJdU1ceq6mPAS4GX\n97LyJAcluSHJLUluS7K2mX94ko1JNie5JsnCrjZrkmxJckeSU/r5pSRJe0ev10k8pmt6YetSu6iq\nh4AXVtWzgGcCpyVZBqwGrq2q44DrgDUASY4HVtEZEuQ04KIkDiwoSUPSS0i8B7glyd8kWQ/8I/A/\ne91AM4IswEF0joEUsBJY38xfD5zeTK8ALq+q7VW1FdgCLOt1W5KkvauXA9efTDIBPLuZdV5Vbet1\nA0kW0AmWJwMfqaqbkiyqqslm/duSHNEsfiTwja7mdzfzJElDsMeQAKiqe4Cr+9lAVe0AnpXkMODK\nJE+j05vYabGZr3ld1/Ty5iFJmjIxMcHExMSs1pGqPj6f+91Y5xaoDwK/Byyvqskki4Hrq2ppktV0\n7lVxQbP8l4C1VXXDLuupvnIFgI3AqfTfHppbath+LNuPc+22H+/2s9/2bD+vk1BVMzrOO9AB/pI8\nfurMpSQHAy8B7qDTK3l9s9hZwFXN9NXAmUkOTHIM8BTgxkHWKElqN+3upiT7AbdX1VP7XP+vA+ub\n4xILgE9V1ReTfBPYkORs4E46ZzRRVZuSbKBzPcbDwDk1l10dSdJO9ri7KclVwB9U1b/MTUl75u4m\n27u7yfbj1348dzf1cuD6cOD2JDcCD0zNrKoVM6xPkjRmegmJ8wdehSRpJPVyncRXkxwFHFtV1yY5\nBNhv8KVJkoatlwH+3gRcAfxVM+tI4LODLEqSNBp6OQX2LcDJwL8DVNUW4IhpW0iS5oVeQuKhqvqP\nqSdJpsZfkiTNc72ExFeTvAM4OMlLgE8DnxtsWZKkUdBLSKwGfgjcBrwZ+CLwx4MsSpI0Gno5u2lH\nM0T4DXR2M232KmhJ2jfsMSSSvAz4S+D7dC4ZPCbJm6vq7wZdnCRpuHq5mO79dO4u908ASZ4MfAEw\nJCRpnuvlmMR9UwHR+GfgvgHVI0kaIa09iSRnNJPfSvJFYAOdYxKvBG6ag9okSUM23e6mV3RNTwIv\naKZ/CBw8sIokSSOjNSSq6g1zWYgkafT0cnbTMcAfAEd3L+9Q4ZI0//VydtNngUvoXGW9Y7DlSJJG\nSS8h8fOq+vOBVyJJGjm9hMSFSdbSue/nQ1Mzq+rmgVUlSRoJvYTE04HXAS/ikd1N1TyXJM1jvYTE\nK4EndQ8XLknaN/RyxfV3gccMuhBJ0ujppSfxGOB7SW5i52MSngIrSfNcLyGxduBVSJJGUi/3k/jq\nXBQiSRo9vVxxfR+P3NP6QOAA4IGqOmyQhUmShq+XnsSjp6aTBFgJPGeQRUmSRkMvZzf9UnV8Fjh1\nQPVIkkZIL7ubzuh6ugA4Efh5LytPsgT4OLCIzoV4H62qP09yOPAp4ChgK7Cqqn7atFkDnA1sB86t\nqo09/zaSpL2ql7Obuu8rsZ3Oh/rKHte/HXhbVX07yaOAf0yyEXgDcG1V/VmS84A1wOokxwOrgKXA\nEuDaJMdWVbVtQJI0OL0ck+j7vhJVtQ3Y1kzfn+QOOh/+K3nkJkbrgQlgNbACuLyqtgNbk2wBlgE3\n9FuDJKl/092+9J3TtKuqevdMNpTkaOCZwDeBRVU12axoW5IjmsWOBL7R1ezuZp4kaQim60k8sJt5\nhwJvBB4H9BwSza6mK+gcY7g/ya67j/rYnbSua3p585AkTZmYmGBiYmJW60gvu/uTPBo4l05AbADe\nX1X39rSBZH/g88DfVdWFzbw7gOVVNZlkMXB9VS1NsppOL+WCZrkvAWur6oZd1ll95QrQGfH8VPpv\nDxDbj237ca7d9uPdfvbbnu3h2SRUVWbSZtpTYJM8NsmfALfS6XWcUFXn9RoQjY8Bm6YConE18Ppm\n+izgqq75ZyY5sLlt6lOAG2ewLUnSXjTdMYn3AWcAFwNPr6r7Z7ryJCcDrwFuS3ILnRh9B3ABsCHJ\n2cCddM5ooqo2JdkAbAIeBs7xzCZJGp7W3U1JdtAZ9XU7O/eRQmeX0NCG5XB3k+3d3WT78Ws/nrub\nWnsSVTWjq7ElSfOPQSBJamVISJJaGRKSpFaGhCSplSEhSWplSEiSWhkSkqRWhoQkqZUhIUlqZUhI\nkloZEpKkVoaEJKmVISFJamVISJJaGRKSpFaGhCSplSEhSWplSEiSWhkSkqRWhoQkqZUhIUlqZUhI\nkloZEpKkVoaEJKmVISFJamVISJJaGRKSpFYDDYkklySZTHJr17zDk2xMsjnJNUkWdv1sTZItSe5I\ncsoga5Mk7dmgexKXAqfuMm81cG1VHQdcB6wBSHI8sApYCpwGXJQkA65PkjSNgYZEVX0d+Mkus1cC\n65vp9cDpzfQK4PKq2l5VW4EtwLJB1idJmt4wjkkcUVWTAFW1DTiimX8kcFfXcnc38yRJQzIKB65r\n2AVIknZv/yFsczLJoqqaTLIYuLeZfzfwhK7lljTzWqzrml7ePCRJUyYmJpiYmJjVOlI12C/ySY4G\nPldVT2+eXwD8uKouSHIecHhVrW4OXF8GnERnN9OXgWNrNwUmqf47IBvpHEufze8d249t+3Gu3fbj\n3X72257t53USqmpGJwQNtCeR5BN0vuI/Lsm/AGuB9wKfTnI2cCedM5qoqk1JNgCbgIeBc3YXEJKk\nuTPwnsQg2JOwvT0J249f+/HsSYzCgWtJ0ogyJCRJrQwJSVIrQ0KS1MqQkCS1MiQkSa0MCUlSK0NC\nktTKkJAktTIkJEmtDAlJUitDQpLUypCQJLUyJCRJrQwJSVIrQ0KS1MqQkCS1MiQkSa0MCUlSK0NC\nktTKkJAktTIkJEmtDAlJUitDQpLUypCQJLUyJCRJrQwJSVIrQ0KS1GokQyLJS5N8L8n/TXLesOuR\npH3VyIVEkgXAh4FTgacBr07y1OFWNd9NDLuAeWZi2AXMIxPDLmCfN3IhASwDtlTVnVX1MHA5sHLI\nNc1zE8MuYJ6ZGHYB88jEsAvY541iSBwJ3NX1/AfNPEnSHNt/2AX067DDXtFXu+3b7+XBB/dyMZI0\nT6Wqhl3DTpI8B1hXVS9tnq8Gqqou6FpmtIqWpDFRVZnJ8qMYEvsBm4EXA/cANwKvrqo7hlqYJO2D\nRm53U1X9IsnvAxvpHDO5xICQpOEYuZ6EJGl0jOLZTdPyQru9K8nWJN9JckuSG4ddzzhJckmSySS3\nds07PMnGJJuTXJNk4TBrHCctr+faJD9IcnPzeOkwaxwnSZYkuS7J7UluS/LWZv6M3qNjFRJeaDcQ\nO4DlVfWsqlo27GLGzKV03ovdVgPXVtVxwHXAmjmvanzt7vUE+EBVndA8vjTXRY2x7cDbquppwHOB\ntzSflzN6j45VSOCFdoMQxu99MBKq6uvAT3aZvRJY30yvB06f06LGWMvrCZ33qGaoqrZV1beb6fuB\nO4AlzPA9Om4fDl5ot/cV8OUkNyV507CLmQeOqKpJ6PyRAkcMuZ754PeTfDvJX7v7rj9JjgaeCXwT\nWDST9+i4hYT2vpOr6gTgd+h0R5837ILmGc8MmZ2LgCdV1TOBbcAHhlzP2EnyKOAK4NymR7Hre3La\n9+i4hcTdwBO7ni9p5qlPVXVP8+8PgSvp7NJT/yaTLAJIshi4d8j1jLWq+mE9cgrmR4FnD7OecZNk\nfzoB8bdVdVUze0bv0XELiZuApyQ5KsmBwJnA1UOuaWwlOaT5lkGSQ4FTgO8Ot6qxE3beZ3418Ppm\n+izgql0baFo7vZ7Nh9iUM/D9OVMfAzZV1YVd82b0Hh276ySaU+Au5JEL7d475JLGVpJj6PQeis6F\nlZf5evYuySeA5cDjgElgLfBZ4NPAE4A7gVVV9W/DqnGctLyeL6SzL30HsBV489T+dE0vycnA3wO3\n0fkbL+AddEax2ECP79GxCwlJ0twZt91NkqQ5ZEhIkloZEpKkVoaEJKmVISFJamVISJJaGRKaN5Ls\nSPK+rudvT/LOYdY0KEmuT3LCbuafleRDw6hJ85MhofnkIeCMJI8ddiHwy1vxDoMXP2mvMSQ0n2wH\nLgbetusPkjw+yRVJbmgez23m35rksGb6R0le20yvT/LiJMc3y9/cjET65Obn5zc3v/r7JJ9I8rZm\n/vVJPtjcwOmtzRAyX2nafjnJkma5S5Oc0VXffc2/L0jy1SSfb9Z/0Z5+6SRvaG4g803g5Nm9hNLO\nDAnNJwV8BHhNkkfv8rML6dy85iTgd4FLmvlfB05O8jTg+8Dzm/nPBf4B+G/A/2pGyj0R+EGSE4H/\nCjydzui5J+6yrQOqallVfRD4EHBpM4rpJ5rnbbVPeTbwFmApnbHKzth9k1+ObbSuqfd5wPFty0r9\n2H/YBUh7U1Xdn2Q9cC7ws64f/TawNMnU4HGPSnIInZB4AZ0xbP4SeFOS3wB+XFU/S/IN4I+SPAH4\nTFX9UzMmzlXNja8eTvK5Xcr4VNf0c+kECsDfAhf08GvcWFV3AiT5JJ0P/8+0LHsScH1V/bhZ/lPA\nsT1sQ+qJPQnNRxcCbwQO7ZoX4KTmNq3PqqonVtWDdAZAez6dD+LrgR/R6Wl8DaCqPgm8gk7gfCHJ\nC3vY/gNd023HB7bT/P01wXXgNG32dIzBO7dpYAwJzScBqKqf0Bnl8o1dP9tIp3fRWTB5RrPsD4DH\nA8dW1VY6PYv/QSc8SHJMVf2/qvoQnSGWnw78H2BFkoOaodZfPk1N/wC8upl+LU340BnRdGo31Urg\ngK42y5pjGQuAVzU1tbkB+K3m5vYHAK+cZllpxgwJzSfd37jfT2fI6al55wInJvlOku8Cb+5a9pvA\n5mb6a8Bv8MgH86ok301yC/A04ONV9S06Y/B/B/gCcCvw093UAPBW4A1Jvg28hkeC6qPAC5r1Poed\nex/fAj4M3A58v6qubPtdm9tPrmt+h68Bm3azrNQ3hwqX+pDk0Kp6IMnBdHodb5q66fws1/sC4O1V\ntWLWRUp7gQeupf5cnOR44CDgb/ZGQEijyJ6EJKmVxyQkSa0MCUlSK0NCktTKkJAktTIkJEmtDAlJ\nUqv/D6MOnIMuR98ZAAAAAElFTkSuQmCC\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x7f81f67dcc10>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"First post!\n",
"I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n"
]
}
],
"source": [
"%matplotlib inline\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n",
"\n",
"post_texts = data.data\n",
"news_group_ids = data.target\n",
"\n",
"print data.description\n",
"\n",
"print \"Number of posts\", len(data.data)\n",
"import matplotlib.pyplot as plt\n",
"plt.hist(data.target, bins=20)\n",
"plt.xlabel('Newsgroup Id')\n",
"plt.ylabel('Number of Posts')\n",
"plt.show()\n",
"\n",
"print \"First post!\"\n",
"print data.data[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, you will be writing a function to compute the term frequency part of [Tf-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf). It's up to you how fancy to make this function. In my simple version, I used split after first removing leading or trailing punctuation (I used the `strip` function) and also converting the words to lower case."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def createLowerWordList(line):\n",
" \"\"\"\n",
" Given a line of string, seperates \n",
" the string into lower case word and\n",
" get rid of punctuations and numbers\n",
" \"\"\"\n",
" # get a splited words list and an empty list\n",
" wordList1 = line.split()\n",
" wordList2 =[]\n",
" # loop through the word list to get rid of punctuations and convert words to lower case\n",
" for word in wordList1:\n",
" cleanWord = \"\"\n",
" \n",
" if '-' in word:\n",
" cleanWord = word\n",
" else:\n",
" for char in word:\n",
" if char in '!,.?\":;0123456789':\n",
" char = \"\"\n",
" cleanWord += char\n",
"\n",
" cleanWord = cleanWord.lower()\n",
" if cleanWord != \"\":\n",
" wordList2.append(cleanWord)\n",
" return wordList2\n",
"\n",
"def tf(text):\n",
" \"\"\" Returns a dictionary where keys are words that occur in text\n",
" and the value is the number of times that each word occurs. \"\"\"\n",
" word_dict = {}\n",
" word_list = createLowerWordList(text)\n",
" \n",
" for word in word_list:\n",
" if word in word_dict:\n",
" word_dict[word] += 1\n",
" else:\n",
" word_dict[word] = 1\n",
" \n",
" return word_dict\n",
"\n",
"def quick_check(text):\n",
" \"\"\" Returns a dictionary where keys are words that occur in text\n",
" and the value is a 1 for every word. \"\"\"\n",
" word_dict = {}\n",
" word_list = createLowerWordList(text)\n",
" \n",
" for word in word_list:\n",
" if word not in word_dict:\n",
" word_dict[word] = 1\n",
" \n",
" return word_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, you will be writing a function to compute the inverse document frequency part of [Tf-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Lowest IDF (most common)\n",
"(u'the', 0.1793914362572215)\n",
"(u'to', 0.2931770787451345)\n",
"(u'a', 0.3089423640640295)\n",
"(u'and', 0.3853401835575585)\n",
"(u'of', 0.39029004386586114)\n",
"(u'i', 0.4582294839125497)\n",
"(u'in', 0.47457478229496936)\n",
"(u'is', 0.4801307478656507)\n",
"(u'that', 0.562891431606236)\n",
"(u'it', 0.5899456138728572)\n",
"\n",
"Highest IDF (least common)\n",
"(u'jawbone', 9.333796175903101)\n",
"(u\"palestine'\", 9.333796175903101)\n",
"(u'mpryz@(+s>ux=(ill(qi^\\\\wpd]iaw*\\\\', 9.333796175903101)\n",
"(u'&\\\\&ob\\\\<^r=\\\\t++](@py/vp', 9.333796175903101)\n",
"(u'fusion-powered', 9.333796175903101)\n",
"(u'92-cover', 9.333796175903101)\n",
"(u'\"push-in\"', 9.333796175903101)\n",
"(u'/msdos/dskutl/seagatezip', 9.333796175903101)\n",
"(u'mages/sorcerors', 9.333796175903101)\n",
"(u'codings', 9.333796175903101)\n"
]
}
],
"source": [
"from math import log\n",
"from collections import defaultdict\n",
"import operator\n",
"\n",
"def merge_dict(dicts, defaultdict=defaultdict):\n",
" \"\"\"Returns a dictionary where keys are the unique keys appeared in the provided\n",
" dictionary list, and the values are sum of all the value in all dictionaries\n",
" for specific key\"\"\"\n",
" merged = defaultdict(int)\n",
" for d in dicts:\n",
" for k in d:\n",
" merged[k] += d[k]\n",
" return merged\n",
"\n",
"def idf(data):\n",
" \"\"\" Returns a dictionary where the keys are words and the values are inverse\n",
" document frequencies. For this function you should use the formula\n",
" idf(w, data) = log(N / |text in data that contain the word w|) \"\"\"\n",
" \n",
" idf_dicts = []\n",
" N = float(len(data))\n",
" \n",
" for doc in data:\n",
" idf_dicts.append(quick_check(doc))\n",
"\n",
" idf_dict = merge_dict(idf_dicts)\n",
" \n",
" for key in idf_dict:\n",
" idf_dict[key] = log(N/idf_dict[key])\n",
" \n",
" return idf_dict\n",
"\n",
"idf = idf(data.data)\n",
"sorted_idf = sorted(idf.items(), key=operator.itemgetter(1))\n",
"\n",
"print \"Lowest IDF (most common)\"\n",
"for d in sorted_idf[0:10]:\n",
" print d\n",
"\n",
"print \"\"\n",
"print \"Highest IDF (least common)\"\n",
"rev_sorted_idf = sorted(idf.items(), key=operator.itemgetter(1))\n",
"for d in reversed(rev_sorted_idf[-10:]):\n",
" print d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last step in Tf-IDF is to compute the product of tf and IDF for each document, and then convert the resultant dictionary of Tf-IDF features into a vector. We'll be discussing this next class."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clarifying Questions\n",
"\n",
"Use this space to ask questions regarding the content covered in the reading. These questions should be restricted to helping you better understand the material. For questions that push beyond what is in the reading, use the next answer field. If you don't have a fully formed question, but are generally having a difficult time with a topic, you can indicate that here as well."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Enrichment Questions\n",
"\n",
"Use this space to ask any questions that go beyond (but are related to) the material presented in this reading. Perhaps there is a particular topic you'd like to see covered in more depth. Perhaps you'd like to know how to use a library in a way that wasn't show in the reading. One way to think about this is what additional topics would you want covered in the next class (or addressed in a followup e-mail to the class). I'm a little fuzzy on what stuff will likely go here, so we'll see how things evolve."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Resources / Explorations\n",
"\n",
"If you found any useful resources, or tried some useful exercises that you'd like to report please do so here. Let us know what you did, what you learned, and how others can replicate it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}