<a href="https://colab.research.google.com/github/boleamol/NLP/blob/main/TFIDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [3]:
paragraph = """Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining. Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. Data Science is now often used interchangeably with earlier concepts like business analytics, business intelligence, predictive modeling, and statistics. In many cases, earlier approaches and solutions are now simply rebranded as "data science" to be more attractive, which can cause the term to become "dilute beyond usefulness."While many university programs now offer a data science degree, there exists no consensus on a definition or suitable curriculum contents.To its discredit, however, many data-science and big-data projects fail to deliver useful results, often as a result of poor management and utilization of resources."""

In [4]:
sentences = nltk.sent_tokenize(paragraph)  # converted paragraph to sentences.
lemmatizer = WordNetLemmatizer()  # You may use PorterStemmer() for stemming.
corpus = []

In [5]:
# Lemmatization
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])  # Removing special symbol from Sentence
    review = review.lower()    # Converted Sentence tp Lower case
    review = review.split()    # Splitted Sentence to Words
    review = [lemmatizer.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]  # Removing stopwords and Lemmatization
    review = ' '.join(review)
    corpus.append(review)  # Appended to corpus

In [6]:
sentences

['Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from data in various forms, both structured and unstructured, similar to data mining.',
 'Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data.',
 'It employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science.',
 'Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge.',
 'Data Science is now often used interchangeably with earlier concepts like business analytics, business intelligence, predictive modeling, and statistics.',
 'In man

In [7]:
corpus

['data science interdisciplinary field us scientific method process algorithm system extract knowledge insight data various form structured unstructured similar data mining',
 'data science concept unify statistic data analysis machine learning related method order understand analyze actual phenomenon data',
 'employ technique theory drawn many field within context mathematics statistic information science computer science',
 'turing award winner jim gray imagined data science fourth paradigm science empirical theoretical computational data driven asserted everything science changing impact information technology data deluge',
 'data science often used interchangeably earlier concept like business analytics business intelligence predictive modeling statistic',
 'many case earlier approach solution simply rebranded data science attractive cause term become dilute beyond usefulness',
 'many university program offer data science degree exists consensus definition suitable curriculum conte

In [8]:
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()

In [9]:
X # TF-IDF Vector

array([[0.        , 0.23196631, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.33056411, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.23196631, 0.        , 0.19255196,
        0.23196631, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.23196631, 0.        , 0.        ,
        0.23196631, 0.        , 0.23196631, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.19255196,
        0.23196631, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.23196631,
        0.        , 0.        , 0.        , 0.  

In [None]:
# Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well.
# Bag of Words vectors are easy to interpret. However, TF-IDF usually performs better in machine learning models.
