# Feature Extraction (Vectorization)
1. [Approach1: Bag of Words](#Approach1:-Bag-of-Words)
2. [Approach2: Term Frequency - Inverse Document Frequency (TF-IDF)](#Approach2:-Term-Frequency---Inverse-Document-Frequency-(TF-IDF))

In [31]:
sample_documents = ["lifestyle governs mobile choice faster better funkier hardware alone going help phone firms sell handsets research suggests instead phone firms keen get customers pushing technology sake consumers far interested handsets fit lifestyle screen size onboard memory chip inside shows depth study handset maker ericsson historically industry much focus using technology said dr michael bjorn senior advisor mobile media ericsson consumer enterprise lab stop saying technologies change lives said try speak consumers language help see fits told bbc news website study ericsson interviewed 14 000 mobile phone owners ways use phone people habits remain said dr bjorn move activity mobile phone much convenient way one good example diary writing among younger people said diaries always popular mobile phone especially one equipped camera helps keep different form youngsters use text messages also reflects desire chat keep contact friends lets slightly changed way dr bjorn said although consumers always use phone sheer variety new handset technologies make possible gradually drive new habits lifestyles ericsson research shown consumers divide different tribes use phones different ways dr bjorn said groups dubbed pioneers materialists interested trying new things behind start many trends phone use instance said older people using sms much five years ago younger users often children ageing mobile owners encouraged older people try could keep touch another factor governing speed change mobile phone use simple speed new devices bought pioneers materialists 25 people handsets new innovations cameras consumers stop worrying send picture message person end able see significant number users passed use new innovations tends take dr bjorn said early reports camera phone usage japan seemed imply innovation going flop however said 45 japanese people ericsson questioned use camera phone least month 2003 figure 29 similarly across europe numbers people taking snaps cameras starting rise 2003 4 people uk took phonecam snap least month figure 14 similar rises seen many european nations dr bjorn said people also used camera phones different ways film even digital cameras usage patterns digital cameras almost exactly replacing usage patterns analogue cameras said digital cameras tend used significant events weddings holidays birthdays contrast said camera phones used much capture moment woven everyday life",
                   "french honour director parker british film director sir alan parker made officer order arts letters one france highest cultural honours sir alan received decoration paris wednesday french culture minister renaud donnedieu de vabres explored possibilities film immense talent mr de vabres said presented award parker praised french films saying hollywood created modern cinema uses commodity told minister honoured thus distinguished france flag carrier cinema throughout world sir alan films include oscar winning fame plus midnight express commitments founding member director guild great britain former chairman uk film council board british film institute work campaigns shown us artist occupies essential place contemporary society mr de vabres said dreams show us links weave question world mirror work also cited director 2003 film life david gale kevin spacey played man death row proof veritable artistic commitment death sentence", 
                   "fockers fuel festive film chart comedy meet fockers topped festive box office north america setting new record christmas day sequel took 44 7m 23 2m 24 26 december according studio estimates took 19 1m 9 9m christmas day alone highest takings day box office history meet fockers sequel ben stiller comedy meet parents also starring robert de niro blythe danner dustin hoffman barbra streisand despite success meet fockers takings 26 5 2003 figures blamed christmas falling weekend year christmas falls weekend bad business said paul dergarabedian president exhibitor relations compiles box office statistics weekend top 12 films took estimated 121 9m 63 3m compared 165 8m 86 1m last year third lord rings film dominated box office meet fockers knocked last week top film lemony snicket series unfortunate events third place 12 5m 6 5m comedy fat albert co written bill cosby entered chart second place opening christmas day taking 12 7m 6 6m aviator starring leonardo dicaprio howard hughes took 9 4m expanding 40 1 796 cinemas christmas day"]
for i, txt in enumerate(sample_documents):
    print("\narticle:{}\t{}".format(i+1, txt))




article:1	lifestyle governs mobile choice faster better funkier hardware alone going help phone firms sell handsets research suggests instead phone firms keen get customers pushing technology sake consumers far interested handsets fit lifestyle screen size onboard memory chip inside shows depth study handset maker ericsson historically industry much focus using technology said dr michael bjorn senior advisor mobile media ericsson consumer enterprise lab stop saying technologies change lives said try speak consumers language help see fits told bbc news website study ericsson interviewed 14 000 mobile phone owners ways use phone people habits remain said dr bjorn move activity mobile phone much convenient way one good example diary writing among younger people said diaries always popular mobile phone especially one equipped camera helps keep different form youngsters use text messages also reflects desire chat keep contact friends lets slightly changed way dr bjorn said although consume

## Approach1: Bag of Words
### scikit-learn library provides CountVectorizer class to perform this vectorization process. Firstly, Identify Unique words...

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

# Identify Unique words
vectorizer = CountVectorizer()
vectorizer.fit(sample_documents)
print(f'vector vocabulary - {vectorizer.vocabulary_}\n')


vector vocabulary - {'lifestyle': 236, 'governs': 186, 'mobile': 257, 'choice': 79, 'faster': 158, 'better': 55, 'funkier': 180, 'hardware': 194, 'alone': 37, 'going': 183, 'help': 195, 'phone': 290, 'firms': 165, 'sell': 330, 'handsets': 193, 'research': 316, 'suggests': 366, 'instead': 217, 'keen': 223, 'get': 182, 'customers': 103, 'pushing': 305, 'technology': 372, 'sake': 323, 'consumers': 92, 'far': 157, 'interested': 219, 'fit': 166, 'screen': 325, 'size': 346, 'onboard': 276, 'memory': 250, 'chip': 78, 'inside': 215, 'shows': 340, 'depth': 111, 'study': 364, 'handset': 192, 'maker': 243, 'ericsson': 137, 'historically': 198, 'industry': 212, 'much': 263, 'focus': 172, 'using': 397, 'said': 322, 'dr': 126, 'michael': 253, 'bjorn': 58, 'senior': 332, 'advisor': 31, 'media': 247, 'consumer': 91, 'enterprise': 135, 'lab': 227, 'stop': 361, 'saying': 324, 'technologies': 371, 'change': 73, 'lives': 239, 'try': 387, 'speak': 354, 'language': 228, 'see': 327, 'fits': 167, 'told': 380,

### Perform Vectorization: find the frequency count of each unique word, and zero otherwise. Each article becomes a vector of fixed size.


In [39]:
# encode document
vector = vectorizer.transform(sample_documents)
# summarize encoded vector
print(f'vector shape: {vector.shape}\n')
print(f'article vector\n {vector.toarray()}')


vector shape: (3, 420)

article vector
 [[1 0 0 ... 1 2 1]
 [0 0 0 ... 0 0 0]
 [0 3 1 ... 0 0 0]]


## Approach2: Term Frequency - Inverse Document Frequency (TF-IDF)
- Term Frequency: Simply finds out the frequency of a word in document.
- Inverse Document Frequency: Assigns a lower weight to the words which appear most frequently. It basically depicts the rarity of the word in all documents.
### scikit-learn library provides TfidfVectorizer class to perform this vectorization process. Firstly, Identify Unique words...

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorizer.fit(sample_documents)
# summarize
print(f'vector vocabulary - {vectorizer.vocabulary_}\n')

vector vocabulary - {'lifestyle': 236, 'governs': 186, 'mobile': 257, 'choice': 79, 'faster': 158, 'better': 55, 'funkier': 180, 'hardware': 194, 'alone': 37, 'going': 183, 'help': 195, 'phone': 290, 'firms': 165, 'sell': 330, 'handsets': 193, 'research': 316, 'suggests': 366, 'instead': 217, 'keen': 223, 'get': 182, 'customers': 103, 'pushing': 305, 'technology': 372, 'sake': 323, 'consumers': 92, 'far': 157, 'interested': 219, 'fit': 166, 'screen': 325, 'size': 346, 'onboard': 276, 'memory': 250, 'chip': 78, 'inside': 215, 'shows': 340, 'depth': 111, 'study': 364, 'handset': 192, 'maker': 243, 'ericsson': 137, 'historically': 198, 'industry': 212, 'much': 263, 'focus': 172, 'using': 397, 'said': 322, 'dr': 126, 'michael': 253, 'bjorn': 58, 'senior': 332, 'advisor': 31, 'media': 247, 'consumer': 91, 'enterprise': 135, 'lab': 227, 'stop': 361, 'saying': 324, 'technologies': 371, 'change': 73, 'lives': 239, 'try': 387, 'speak': 354, 'language': 228, 'see': 327, 'fits': 167, 'told': 380,

### Perform Vectorization: find the TF-IDF of each unique word, and zero otherwise. Each article becomes a vector of fixed size.

In [44]:
# encode document
vector = vectorizer.transform(sample_documents)
# summarize encoded vector
print(f'vector shape: {vector.shape}\n')
print(f'article vector\n {vector.toarray()}')

vector shape: (3, 420)

article vector
 [[0.03222937 0.         0.         ... 0.03222937 0.06445873 0.03222937]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.16646059 0.05548686 ... 0.         0.         0.        ]]


In [3]:
import warnings
warnings.filterwarnings("ignore")
from nltk.corpus import stopwords
import nltk
import re

#loading_the_stop_words_from_nltk_library_
stop_words = set(stopwords.words('english'))

def txt_preprocessing(total_text, index, column, df):
    if type(total_text) is not int:
        string = ""
        
        #replace_every_special_char_with_space
        total_text = re.sub('[^a-zA-Z0-9\n]', ' ', total_text)
        
        #replace_multiple_spaces_with_single_space
        total_text = re.sub('\s+',' ', total_text)
        
        #converting_all_the_chars_into_lower_case
        total_text = total_text.lower()
        
        for word in total_text.split():
        #if_the_word_is_a_not_a_stop_word_then_retain_that_word_from_the_data
            if not word in stop_words:
                string += word + " "
        
        df[column][index] = string        

In [20]:
for index, row in train_data.iterrows():
    if type(row['Text']) is str:
        txt_preprocessing(row['Text'], index, 'Text', train_data)
        print(row['Text'])
    else:
        print("THERE IS NO TEXT DESCRIPTION FOR ID :",index)

train_data.head()

NameError: name 'txt_preprocessing' is not defined

In [9]:
train_data[train_data['Category'].isin(['entertainment','tech'])].to_csv('./learn-ai-bbc/train.csv',index=False)


In [26]:
for index, row in result.iterrows():
    if type(row['Text']) is str:
        txt_preprocessing(row['Text'], index, 'Text', result)
    else:
        print("THERE IS NO TEXT DESCRIPTION FOR ID :",index)

result.head()

Unnamed: 0,ArticleId,Text,ArticleId.1,Category
0,1018,qpr keeper day heads for preston queens park r...,1018,sport
1,1319,software watching while you work software that...,1319,tech
2,1138,d arcy injury adds to ireland woe gordon d arc...,1138,business
3,459,india s reliance family feud heats up the ongo...,459,entertainment
4,1020,boro suffer morrison injury blow middlesbrough...,1020,politics


In [30]:
header = ['ArticleId','Text','Category']
subset = result[result['Category'].isin(['entertainment','tech'])]
s = pd.Series(subset.unique())
s.to_csv('./learn-ai-bbc/test.csv', columns = header, index=False)




AttributeError: 'DataFrame' object has no attribute 'unique'