Text-Prediction-App-NLP

Overview

This project aims to build a product which can predict the next word as we start typing the words. In this capstone we will be applying data science in the area of natural language processing. This project includes Natural Language Processing, Text Mining, and the associated tools in R. The prediction is based on the words present in the database on which the model is trained. Maximum Likelihood estimator with Kneyser Ney Smoothing has been used as the prediction model. This is a very simple version of Kneyser Ney smoothing based on certain assumptions made on the discount value. Here are some resources

Natural language processing
Text mining infrastucture in R
CRAN Task View: Natural Language Processing
Language Modelling by Dan Jurafsky
Kneyser Kney Smoothing

Dataset

This is the training data to get you started that will be the basis for most of the capstone. You must download the data from the Coursera site and not from external websites to start.

Capstone Dataset

Prediction Model

Kneser-Ney smoothing is an algorithm designed to adjust the weights (through discounting) by using the continuation counts of lower n-grams. The concept is fairly simple though a bit more difficult to implement in a program.

Given the sentence, “Francisco”" is presented as the suggested ending, because it appears more often than “glasses” in some text.

I can’t see without my reading… __ Francisco __

However, even though “Francisco” appears more often than “glasses”, “Francisco” rarely occurs outside of the context of “San Francisco”. Thus, instead of observing how often a word appears, the Kneser-Ney algorithm takes into account how often a word completes a bigram type (e.g., “prescription glasses”, “reading glasses”, “small glasses” vs. “San Francisco”). This example was taken from Jurafsky’s video lecture on Kneser-Ney Smoothing, which also describes the equation used to calculate the Kneser-Ney probability. https://www.youtube.com/watch?v=wtB00EczoCM

I believe that typically, the smoothing algorithm is performed on all of the n-grams (unigram models, bigram models, etc.) prior to attempting any predictions. However, given the restraint of computing time, I only had enough time to create a version which performs the smoothing when a user provides an input.

Current State of the Project

I have conducted the exploratory analysis for this project. Here is the link to the analysis report.

An Early Build of the application is ready. You can be accessed it by clicking on the following link.

Text Prediction App

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
Exploratory Analysis		Exploratory Analysis
Scripts		Scripts
Slides		Slides
corpus		corpus
rsconnect/shinyapps.io/amanbhagat		rsconnect/shinyapps.io/amanbhagat
.RData		.RData
.Rhistory		.Rhistory
Dockerfiles.txt		Dockerfiles.txt
License.txt		License.txt
README.md		README.md
server.R		server.R
ui.R		ui.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text-Prediction-App-NLP

Overview

Dataset

Prediction Model

Current State of the Project

About

Releases

Packages

Languages

License

amanbhagat77/Text-Prediction-NLP

Folders and files

Latest commit

History

Repository files navigation

Text-Prediction-App-NLP

Overview

Dataset

Prediction Model

Current State of the Project

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages