# Python Classification Feature Extraction for Timbl

**(C) 2017-2024 by [Damir Cavar](http://cavar.me/damir/)**

**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))

**Prerequisites:**

In [None]:
!pip install -U nltk

This is a tutorial related to the discussion of feature extraction for classification and clustering in the textbook [Machine Learning: The Art and Science of Algorithms that Make Sense of Data](https://www.cs.bris.ac.uk/~flach/mlbook/) by [Peter Flach](https://www.cs.bris.ac.uk/~flach/).

This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the [Computational Linguistics Program](http://cl.indiana.edu/) of the [Department of Linguistics](http://www.indiana.edu/~lingdept/) at [Indiana University](https://www.indiana.edu/).

## Feature Extraction

In [1]:
from nltk import word_tokenize

In [25]:
text1 = """The city will pay for it by taxes on properties selling for more
than $5 million.

The real estate transfer tax, as it's called, was increased last year for both
residential and commercial properties. The hike was approved by voters in
November.
Powered by SmartAsset.com
SmartAsset.com

The tax starts at 2.25% and goes up to 3% for properties worth at least $25
million. It's expected to bring in an average of $45 million a year, according
to the city controller. But the money goes into the city's general fund and is
also expected to be used for affordable housing and senior support services.

The free tuition plan is expected to impact about 28,000 residents who currently
take classes at City College of San Francisco and encourage more people to sign
up. Chancellor Susan Lamb said the school has the capacity for 85,000 students.

It's difficult to predict how many more people will enroll, and how much the
free-tuition plan will end up costing. San Francisco has committed $5.4 million
a year for the next two years, and then will have to reassess. That includes a
one-time $500,000 stipend to City College to help handle an influx of students.

Related: Why New York's 'tuition-free colleges' still cost $14K

San Francisco's tuition-free plan is more progressive than others round the
country. First, everyone is eligible as long as they have resided in San Francisco
for at least one year.

It covers the $46 cost per credit no matter how rich you are, "even to the
children of the founders of Facebook," said city lawmaker Jane Kim.

You don't have to be enrolled full-time or be a recent high school graduate.
This means that people who are seeking job retraining or want to take a few
foreign language courses won't have to pay for the cost of the credits.

Related: Rhode Island governor wants to make college free, too

Students will still be on the hook for the mandatory $17 per semester fee at
City College and the cost of books, so college won't necessarily be free.

What also sets apart San Francisco's plan is that it offers the poorest students
additional money to help pay for these other expenses. An individual has to earn
less than $17,000 a year to qualify for the aid, or less than $37,000 for a
family of four. Eligible full-time students will get $500 a year and part-time
students will get $200 a year.

"We have the fastest growing income gap than any city across the nation," Kim said
on Monday at a press conference.

"Making city college free is going to provide greater opportunities for more San
Franciscans to enter into the middle class and more San Franciscans to stay in the
middle class if they currently are," she said.

The push for free tuition is gaining support across the country. Tennessee started
offering free community college to residents in 2015, and will expand the program
this year to include adults returning to school. Lawmakers in New York are
discussing a program that would make four-year and two-year public colleges
tuition-free for residents who earn less than $125,000 a year. And Rhode Island's
governor is pushing for two free years at public colleges for recent high school
graduates."""

In [66]:
tokens1 = word_tokenize(text1.lower())

In [67]:
from collections import Counter

In [68]:
fp = Counter(tokens1)

print(fp)

Counter({'.': 26, 'the': 24, 'to': 24, 'for': 18, ',': 14, '$': 13, 'and': 12, 'a': 12, 'year': 9, 'of': 8, 'is': 8, "'s": 8, 'will': 8, 'San': 7, 'free': 7, 'at': 7, 'The': 6, 'than': 6, 'in': 6, 'city': 6, 'more': 6, 'have': 5, 'be': 5, 'students': 5, 'Francisco': 5, 'cost': 4, 'said': 4, 'plan': 4, 'school': 4, 'college': 4, 'million': 4, 'are': 4, 'It': 3, 'on': 3, 'or': 3, 'up': 3, "n't": 3, 'colleges': 3, 'residents': 3, 'less': 3, 'people': 3, 'by': 3, 'pay': 3, 'that': 3, 'how': 3, 'City': 3, 'expected': 3, 'has': 3, 'properties': 3, 'who': 3, 'it': 3, 'College': 3, '``': 3, "''": 3, 'as': 3, 'across': 2, 'middle': 2, 'earn': 2, 'goes': 2, 'they': 2, 'Related': 2, 'tuition-free': 2, 'full-time': 2, '%': 2, 'support': 2, 'an': 2, 'years': 2, 'take': 2, 'York': 2, 'was': 2, 'public': 2, ':': 2, 'class': 2, 'Rhode': 2, 'Kim': 2, 'money': 2, 'two': 2, 'least': 2, 'also': 2, 'New': 2, 'program': 2, 'currently': 2, 'per': 2, 'governor': 2, 'Island': 2, 'help': 2, 'into': 2, 'tax': 2,

In [69]:
model = [ (i, fp[i], len(i)) for i in fp ]

print(model)

[('across', 2, 6), ('progressive', 1, 11), ('have', 5, 4), ('middle', 2, 6), ('wants', 1, 5), ('earn', 2, 4), ('last', 1, 4), ('It', 3, 2), ('for', 18, 3), ('fund', 1, 4), ('of', 8, 2), ('on', 3, 2), ('We', 1, 2), ('she', 1, 3), ('goes', 2, 4), ('they', 2, 4), ('5', 1, 1), ('called', 1, 6), ('cost', 4, 4), ('affordable', 1, 10), ('Powered', 1, 7), ('.', 26, 1), ('or', 3, 2), ('transfer', 1, 8), ('foreign', 1, 7), ('up', 3, 2), ('enter', 1, 5), ('said', 4, 4), ('Related', 2, 7), ("n't", 3, 3), ('125,000', 1, 7), ('colleges', 3, 8), ('residents', 3, 9), ('and', 12, 3), ('tuition-free', 2, 12), ('full-time', 2, 9), ('community', 1, 9), ('pushing', 1, 7), ('approved', 1, 8), ('%', 2, 1), ('support', 2, 7), ('gap', 1, 3), ('going', 1, 5), ('even', 1, 4), ('income', 1, 6), ('an', 2, 2), ('years', 2, 5), ('17', 1, 2), ('nation', 1, 6), ('include', 1, 7), ('gaining', 1, 7), ('apart', 1, 5), ('part-time', 1, 9), ('capacity', 1, 8), ('less', 3, 4), ('matter', 1, 6), ('people', 3, 6), ('used', 1,

In [70]:
for x in model:
    print( "\t".join( (str(x[1]), str(x[2]), x[0]) ) )


2	6	across
1	11	progressive
5	4	have
2	6	middle
1	5	wants
2	4	earn
1	4	last
3	2	It
18	3	for
1	4	fund
8	2	of
3	2	on
1	2	We
1	3	she
2	4	goes
2	4	they
1	1	5
1	6	called
4	4	cost
1	10	affordable
1	7	Powered
26	1	.
3	2	or
1	8	transfer
1	7	foreign
3	2	up
1	5	enter
4	4	said
2	7	Related
3	3	n't
1	7	125,000
3	8	colleges
3	9	residents
12	3	and
2	12	tuition-free
2	9	full-time
1	9	community
1	7	pushing
1	8	approved
2	1	%
2	7	support
1	3	gap
1	5	going
1	4	even
1	6	income
2	2	an
2	5	years
1	2	17
1	6	nation
1	7	include
1	7	gaining
1	5	apart
1	9	part-time
1	8	capacity
3	4	less
1	6	matter
3	6	people
1	4	used
1	6	28,000
1	10	individual
1	4	sign
1	7	500,000
1	7	fastest
1	2	46
3	2	by
1	7	resided
1	7	classes
1	10	commercial
1	5	First
1	10	Chancellor
2	4	take
1	5	worth
1	6	starts
2	4	York
1	5	round
1	6	influx
1	7	credits
2	3	was
1	3	You
1	3	too
1	6	voters
1	6	estate
1	4	What
2	6	public
1	6	Making
1	6	handle
1	13	'tuition-free
3	3	pay
6	3	The
2	1	:
2	5	class
3	4	that
3	3	how
24	3	the
1	6	37,000
1	4	hike
2	5	R

In [71]:
from nltk.corpus import stopwords

In [72]:
stopw = stopwords.words("english")
stopw.append("us")

In [73]:
def isStopword(word):
    if word in stopw:
        return(1)
    return(0)

for x in model:
    print( "\t".join( (str(x[1]), str(x[2]), x[0], str(isStopword(x[0]))) ) )


2	6	across	0
1	11	progressive	0
5	4	have	1
2	6	middle	0
1	5	wants	0
2	4	earn	0
1	4	last	0
3	2	It	0
18	3	for	1
1	4	fund	0
8	2	of	1
3	2	on	1
1	2	We	0
1	3	she	1
2	4	goes	0
2	4	they	1
1	1	5	0
1	6	called	0
4	4	cost	0
1	10	affordable	0
1	7	Powered	0
26	1	.	0
3	2	or	1
1	8	transfer	0
1	7	foreign	0
3	2	up	1
1	5	enter	0
4	4	said	0
2	7	Related	0
3	3	n't	0
1	7	125,000	0
3	8	colleges	0
3	9	residents	0
12	3	and	1
2	12	tuition-free	0
2	9	full-time	0
1	9	community	0
1	7	pushing	0
1	8	approved	0
2	1	%	0
2	7	support	0
1	3	gap	0
1	5	going	0
1	4	even	0
1	6	income	0
2	2	an	1
2	5	years	0
1	2	17	0
1	6	nation	0
1	7	include	0
1	7	gaining	0
1	5	apart	0
1	9	part-time	0
1	8	capacity	0
3	4	less	0
1	6	matter	0
3	6	people	0
1	4	used	0
1	6	28,000	0
1	10	individual	0
1	4	sign	0
1	7	500,000	0
1	7	fastest	0
1	2	46	0
3	2	by	1
1	7	resided	0
1	7	classes	0
1	10	commercial	0
1	5	First	0
1	10	Chancellor	0
2	4	take	0
1	5	worth	0
1	6	starts	0
2	4	York	0
1	5	round	0
1	6	influx	0
1	7	credits	0
2	3	was	1
1	3	You	0
1	3	too	1
1	6	vo

In [88]:
from nltk import pos_tag

In [95]:
tokens1 = word_tokenize(text1).lower())
posTokens = pos_tag(tokens1)

In [96]:
tags = list( set( [ x[1][0] for x in posTokens ] ) )

print(tags)

['C', 'D', 'R', '.', ':', 'W', ',', 'N', 'P', 'J', 'T', "'", 'V', 'M', '`', 'I', '$']


In [97]:
from collections import defaultdict

leftOfToken = defaultdict(Counter)
rightOfToken = defaultdict(Counter)

for i in range(len(posTokens)):
    tag = posTokens[i][1][0]
    token = (posTokens[i][0], tag)
    if i > 0:
        ltag = posTokens[i - 1][1][0]
        leftOfToken[token][ltag] += 1
    if i < len(posTokens) - 1:
        rtag = posTokens[i + 1][1][0]
        rightOfToken[token][rtag] += 1

In [98]:
for token in leftOfToken.keys():
    leftVector = []
    rightVector = []
    for tag in tags:
        leftVector.append(leftOfToken[token][tag])
        rightVector.append(rightOfToken[token][tag])
    print(" ".join([ str(x) for x in leftVector ]), " ".join([ str(x) for x in rightVector ]), token[0], token[1])

0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 growing V
1 7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 3 0 0 1 0 0 0 2 0 0 0 0 2 0 year N
1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 cost N
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 everyone N
0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 class N
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 Why W
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 books N
0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 : :
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 But C
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 retraining V
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 general J
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 hike N
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 many J
0 1 0 0 0 0 

In [99]:
text2 = """A flight out of Austin, Texas, was delayed after
a pilot behaved in a way that caused passengers to believe
she was mentally unstable, a United Airlines spokesman said Sunday.
The pilot, whom CNN is not naming, boarded the plane in street
clothes and began speaking to passengers over the intercom,
spokesman Charlie Hobart said.
Passengers on Saturday's San Francisco-bound flight took to
social media to express concerns after the pilot spoke to
them about her divorce and the presidential election, among
other issues. """

tokens2 = word_tokenize(text2).lower())

posTokens2 = pos_tag(tokens2)

leftOfToken2 = defaultdict(Counter)
rightOfToken2 = defaultdict(Counter)

for i in range(len(posTokens2)):
    token = (posTokens2[i][0], posTokens2[i][1][0])
    if i > 0:
        ltag = posTokens2[i - 1][1][0]
        leftOfToken2[token][ltag] += 1
    if i < len(posTokens2) - 1:
        rtag = posTokens2[i + 1][1][0]
        rightOfToken2[token][rtag] += 1

for token in leftOfToken2.keys():
    leftVector = []
    rightVector = []
    for tag in tags:
        leftVector.append(leftOfToken2[token][tag])
        rightVector.append(rightOfToken2[token][tag])
    print(" ".join([ str(x) for x in leftVector ]), " ".join([ str(x) for x in rightVector ]), token[0], token[1])


0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 caused V
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 unstable J
0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 flight N
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Hobart N
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 clothes N
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 presidential J
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 spoke V
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 behaved V
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Texas N
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 believe V
0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 passengers N
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 Airlines N
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0

## Using Timbl in Python