# Data Wrangling Fundamentals
> Data wrangling Exercises Chapter 1
- toc:true -branch:master -badges:true
- comments:false
- categories: [fastpages, jupyter]



Within the data science ecosystem, there is an overwhelming consensus that over ninety-five per cent of the data science process is data cleaning and wrangling. For such an important aspect of data science, you would expect greater attention, but the reality is that like human nature, we are almost always attracted to the shining parts of things. Consequently, beginners from the very beginning of their sojourn towards the unknown are lead astray. Do not be deceived; data science is data cleaning and data wrangling! As a veritable means of closing the yarning knowledge gap in the data wrangling and cleaning craft, I have decided to extract data-wrangling exercises from different resources. Thus this notebook and subsequent ones would highlight these exercises as well as their real-life application in the field of data science.

In [1]:
import numpy as np
import pandas as pd

In [2]:
a = np.random.randn(5,3)
a

array([[ 0.96693939,  1.95682231,  1.71044694],
       [ 0.98359589, -0.54727982,  0.09999907],
       [ 0.18117676, -0.43009298, -1.79813572],
       [ 0.59418704, -0.6342957 , -0.60254704],
       [ 0.87621625,  1.31368179,  0.06375445]])

### Steps of Data Wrangling
- Scraping raw data from multiple sources(including web and database table)
- Imputing(replacing missing data using various techniques), formating, and transforming-basically making it ready to be used in the modeling process
- Handling read/write errors
- Detecting outliers
- Performing quick visualizations(plotting) and basic statistical analyses to judge the quality of formatted data

### Accessing The List Members  Exercise:1.01


In [3]:
ssn = list(pd.read_csv('./The-Data-Wrangling-Workshop/chapter01/datasets/ssn.csv'))

In [4]:
# first Element
ssn[0]

'218-68-9955'

In [5]:
# the 4th element
ssn[3]

'563-93-1393'

In [6]:
# length of the list
ssn[len(ssn)- 1]

'825-05-4836'

In [7]:
ssn[-1]

'825-05-4836'

In [8]:
# first three element
ssn[:3]

['218-68-9955', '165-73-3124', '432-47-4043']

In [9]:
#last two element
ssn[-2:]

['726-13-1007', '825-05-4836']

In [10]:
# first two element with backward indices
ssn[:-2]

['218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476']

In [11]:
#reverse the element in the list
ssn[-1::-1]

['825-05-4836',
 '726-13-1007',
 '812-13-2476',
 '123-05-9652',
 '670-09-7369',
 '153-93-3401',
 '563-93-1393',
 '432-47-4043',
 '165-73-3124',
 '218-68-9955']

In [12]:
ssn[:]

['218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476',
 '726-13-1007',
 '825-05-4836']

### Generating and Iterating through a List: 1.02

In [13]:
# Using the append method to iterate over a list and add element unto another list
ssn_2 = []
for i in ssn:
    ssn_2.append(i)
ssn_2


['218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476',
 '726-13-1007',
 '825-05-4836']

In [14]:
# using list comprehension to generate list
ssn3 = ['soc: '+ x for x in ssn_2]
ssn3

['soc: 218-68-9955',
 'soc: 165-73-3124',
 'soc: 432-47-4043',
 'soc: 563-93-1393',
 'soc: 153-93-3401',
 'soc: 670-09-7369',
 'soc: 123-05-9652',
 'soc: 812-13-2476',
 'soc: 726-13-1007',
 'soc: 825-05-4836']

In [15]:
i = 0
while i < len(ssn3):
    print(ssn3[i])
    i += 1

soc: 218-68-9955
soc: 165-73-3124
soc: 432-47-4043
soc: 563-93-1393
soc: 153-93-3401
soc: 670-09-7369
soc: 123-05-9652
soc: 812-13-2476
soc: 726-13-1007
soc: 825-05-4836


In [16]:
# search for ssn with 5 in the number
numbers = [x for x in ssn3 if '5' in x]
numbers

['soc: 218-68-9955',
 'soc: 165-73-3124',
 'soc: 563-93-1393',
 'soc: 153-93-3401',
 'soc: 123-05-9652',
 'soc: 825-05-4836']

In [17]:
# usign + sign to extend
ssn_4 = ['102-90-0314','247-17-2338','318-22-2760']
ssn_5 = ssn_4 + ssn
ssn_5

['102-90-0314',
 '247-17-2338',
 '318-22-2760',
 '218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476',
 '726-13-1007',
 '825-05-4836']

In [18]:
# usign the extend method
ssn_2.extend(ssn_4)
ssn_2

['218-68-9955',
 '165-73-3124',
 '432-47-4043',
 '563-93-1393',
 '153-93-3401',
 '670-09-7369',
 '123-05-9652',
 '812-13-2476',
 '726-13-1007',
 '825-05-4836',
 '102-90-0314',
 '247-17-2338',
 '318-22-2760']

In [19]:
# #nested list
# for x in ssn_2:
#     for y in ssn_5:
#         print(str(x) + ',' + str(y))

### Iterating Over a List and Checking Membership: 1.03

In [20]:
car_model = list(pd.read_csv('./The-Data-Wrangling-Workshop/chapter01/datasets/car_models.csv'))
car_model

['Escalade ',
 ' X5 M',
 'D150',
 'Camaro',
 'F350',
 'Aurora',
 'S8',
 'E350',
 'Tiburon',
 'F-Series Super Duty ']

In [21]:
# Iterate over a list non-pythonic way
list_1 = [x for x in car_model]
for i in range(0,len(list_1)):
    print(list_1[i])

Escalade 
 X5 M
D150
Camaro
F350
Aurora
S8
E350
Tiburon
F-Series Super Duty 


In [22]:
# iterating in a pythonic manner
for i in list_1:
    print(i)

Escalade 
 X5 M
D150
Camaro
F350
Aurora
S8
E350
Tiburon
F-Series Super Duty 


In [23]:
'D150' in list_1, 'Mustang' in list_1

(True, False)

### Sorting A List: Exercise 1.04

In [24]:
list_1 = [*range(0, 21,1)]
list_1.sort(reverse=True)
list_1

[20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

In [25]:
list_1.reverse()
list_1

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [26]:
#

In [27]:
# List Of Random number
list_2 = [ x**2 for x in list_1]
list_2

[0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 100,
 121,
 144,
 169,
 196,
 225,
 256,
 289,
 324,
 361,
 400]

In [28]:
from math import log
import random

In [29]:
list_1a = [random.randint(0,30) for x in range(0,100)]


In [30]:
# find the sqaure root of each
sqrty = [randy**2 for randy in list_1a]


In [31]:
# log of 1 elements of sqrty
log_lst = [log(x + 1, 10) for x in sqrty]


### Activity 1.01 Handling List

In [32]:
# list of 100 random number
hundred_rand = [random.randint(0,30) for x in range(0,101)]


In [33]:
div_three = [x for x in hundred_rand if x % 3==0]


In [34]:
# difference in length of list
diff_len = len(hundred_rand) - len(div_three)

In [35]:
new_lst = []
number_of_experiment = 10
for g in range(0, number_of_experiment):
    randyx = [random.randint(0,100) for x in range(0,100)]
    div_3x = [x for x in randyx if x % 3==0]
    diff_len = len(randyx) - len(div_3x)
    new_lst.append(diff_len)
new_lst
    

[61, 70, 62, 62, 73, 70, 60, 66, 59, 72]

In [36]:
from scipy import mean

In [37]:
#average or mean
the_mean = mean(new_lst)
the_mean


  


65.5

### Introduction to Sets


In [38]:
list_12 = list(set(hundred_rand))
list_12


[0,
 1,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 30]

### Union and Intersection of Set


In [39]:
set_1 = {'Apple', 'Orange', 'Banana'}
set_2 = {'Pear', 'Peach', 'Mango', 'Banana'}
#the union of two set is ..
set_1 | set_2

{'Apple', 'Banana', 'Mango', 'Orange', 'Peach', 'Pear'}

In [40]:
# intersection of two sets
set_1 & set_2

{'Banana'}

### Creating Null set


In [41]:
# to create null set
non_set = set({})
non_set

set()

### Dictionary 

In [42]:
dict_1 = {'key1':'value1', 'key2':'value2'}


### Accessing and Setting Values in a dictionary

### Revisiting the unique Valued List Problem
- `dict()` `fromkeys()` and `keys()`

In [43]:
# generate a random list with duplicate values
list_rand = [random.randint(0,30) for x in range(0,100)]



In [44]:
# Create a unique valid list from list_rand
list(dict.fromkeys(list_rand).keys())

[5,
 22,
 29,
 19,
 20,
 3,
 28,
 24,
 13,
 7,
 8,
 6,
 16,
 12,
 30,
 18,
 15,
 1,
 0,
 9,
 23,
 2,
 21,
 17,
 14,
 11,
 27,
 25,
 4,
 26,
 10]

### Deleting a Value From Dict ex.1.09
Involves deleting a value from dict using the del method

In [45]:
dict_1 = {"key1": 1, "key2": ["list_element1", 34], "key3": "value3", 

          "key4": {"subkey1": "v1"}, "key5": 4.5} 
dict_1 

{'key1': 1,
 'key2': ['list_element1', 34],
 'key3': 'value3',
 'key4': {'subkey1': 'v1'},
 'key5': 4.5}

In [46]:
# use the del function and specifiy the element to be deleted
del dict_1['key2']
dict_1

{'key1': 1, 'key3': 'value3', 'key4': {'subkey1': 'v1'}, 'key5': 4.5}

In [47]:
#delete key3 and key4
del dict_1['key3']
del dict_1['key4']
dict_1

{'key1': 1, 'key5': 4.5}

### Dictionary Comprehension ex 1.10
Dictionary comprehension though rarely used but could come handle in the processing of creating important key-value pairs form of data like names of customer and their age, credit card customer and their owners

In [48]:
list_dict = [x for x in range(0,10)]
dict_comp = {x: x**2 for x in list_dict}
dict_comp

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81}

In [49]:
#generate a dictionary with using a list
## using the dict fuction
dict_2 = dict([('Tom',100),('Dick',200),('Harry',300)])
dict_2

{'Tom': 100, 'Dick': 200, 'Harry': 300}

In [50]:
# using the dict function to create dictionary
dict_3 = dict(Tom=100, Dick=200,Harry=300)
dict_3

{'Tom': 100, 'Dick': 200, 'Harry': 300}

## Tuples
- A unique feature of tuple is that of immutability. That is once crested it can not be updated by way of adding or removing from it

- tuple consist of values separated by comma 

In [51]:
tuple_1 = 24,42,2.3456, 'Hello'

- the length of the tuple is called **cardinality**

### Creating a Tuple with Different Cardinality

In [52]:
# creating an empty tuple
tuple_1 = ()

In [53]:
# tuple with only one value. The trailing comma must follow
tuple_1 = 'Hello',

In [54]:
# nesting tuple like list
tuple_1 = 'hello', 'there'
tuple_12 = tuple_1, 45, 'Sam'
tuple_12

(('hello', 'there'), 45, 'Sam')

In [55]:
# # the immutability of tuple
# tuple_1 = 'Hello', 'World!'
# tuple_1[1] = 'Universe'

In [56]:
# access elements in a tuple
tuple_1 = ('good','morning!', 'how','are','you?')
tuple_1[0]

'good'

In [57]:
tuple_1[4]

'you?'

### Unpacking a Tuple

In [58]:
tuple_1 = 'Hello', 'World'
hello, world = tuple_1
print(hello)
print(world)

Hello
World


### Handling Tuple Ex 1.11

In [59]:
tupleE = '1', '3', '5'
tupleE

('1', '3', '5')

In [60]:
# print variables at 0th and 1st
print(tupleE[0])
print(tupleE[1])

1
3


## Strings
- An important feature of string is that it's immutable


### Accessing String Ex1.12

In [61]:
#create a string
string_1 = "Hello World!"
string_1

'Hello World!'

In [62]:
# access the first member of the string
string_1[0]

'H'

In [63]:
#access the fifth member of the string
string_1[4]

'o'

In [64]:
# access the last member of the string
string_1[-1]

'!'

### String Slices Ex 1.13

In [65]:
#create string
string_a = "Hello World! I am Learning data wrangling"
string_a

'Hello World! I am Learning data wrangling'

In [66]:
#specifiy the slicing values and slice the sring
string_a[2:10]

'llo Worl'

In [67]:
#by skipping a slice value
string_a[-31:]

'd! I am Learning data wrangling'

In [68]:
#using negative number for slicing
string_a[-10:-5]

' wran'

### String Functions

In [69]:
# find the length of a string with len()
len(string_a)

41

In [70]:
# convert string case
## use lower() and upper() methods
str_1 = "A COMPLETE UPER CASE STRING"


In [71]:
str_1.lower()

'a complete uper case string'

In [72]:
str_1.upper()

'A COMPLETE UPER CASE STRING'

In [73]:
#search for a string within a string
## use find method
str_1 = "A complicated string look like this"

In [74]:
str_1.find('complicated')

2

In [75]:
str_1.find('hello')

-1

In [76]:
# to replace a string with another
### use the replace method
str_1

'A complicated string look like this'

In [77]:
str_1.replace('complicated', 'simple')

'A simple string look like this'

### Splitting and Joining String Ex 1.14
- split and join methods
- use str.split(separator)
- use str.join(separator)

In [78]:
#create a string and convert it into a list
## using split
str_1 = "Name, age, Sex, Address"
list_1 = str_1.split(',')
list_1

['Name', ' age', ' Sex', ' Address']

In [79]:
# combine list into another string
s = '|'
s.join(list_1)

'Name| age| Sex| Address'

### Activity 1.02 Analyzing a Multi-line String and Generating the Unique Word Count

In [80]:
multiline_text= """It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.

"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"

Mr. Bennet replied that he had not.

"But it is," returned she; "for Mrs. Long has just been here, and she told me all about it."

Mr. Bennet made no answer.

"Do you not want to know who has taken it?" cried his wife impatiently.

"You want to tell me, and I have no objection to hearing it."

This was invitation enough.

"Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man of large fortune from the north of England; that he came down on Monday in a chaise and four to see the place, and was so much delighted with it, that he agreed with Mr. Morris immediately; that he is to take possession before Michaelmas, and some of his servants are to be in the house by the end of next week."

"What is his name?"""

In [81]:
multiline_text

'It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\n\nHowever little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.\n\n"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"\n\nMr. Bennet replied that he had not.\n\n"But it is," returned she; "for Mrs. Long has just been here, and she told me all about it."\n\nMr. Bennet made no answer.\n\n"Do you not want to know who has taken it?" cried his wife impatiently.\n\n"You want to tell me, and I have no objection to hearing it."\n\nThis was invitation enough.\n\n"Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man of large fortune from the north of England; that he came down on Monday in a chaise a

In [82]:
# find its type and length
type(multiline_text), len(multiline_text)

(str, 1228)

In [83]:
# remove all symbols and new lines using the replace
multiline = multiline_text.replace('\n','').replace('?','').replace('.','').replace(';','').replace(',','').replace('"',' ')
multiline

'It is a truth universally acknowledged that a single man in possession of a good fortune must be in want of a wifeHowever little known the feelings or views of such a man may be on his first entering a neighbourhood this truth is so well fixed in the minds of the surrounding families that he is considered the rightful property of some one or other of their daughters My dear Mr Bennet  said his lady to him one day  have you heard that Netherfield Park is let at last Mr Bennet replied that he had not But it is  returned she  for Mrs Long has just been here and she told me all about it Mr Bennet made no answer Do you not want to know who has taken it  cried his wife impatiently You want to tell me and I have no objection to hearing it This was invitation enough Why my dear you must know Mrs Long says that Netherfield is taken by a young man of large fortune from the north of England that he came down on Monday in a chaise and four to see the place and was so much delighted with it that h

In [84]:
list_word = multiline.split(' ')
list_word

['It',
 'is',
 'a',
 'truth',
 'universally',
 'acknowledged',
 'that',
 'a',
 'single',
 'man',
 'in',
 'possession',
 'of',
 'a',
 'good',
 'fortune',
 'must',
 'be',
 'in',
 'want',
 'of',
 'a',
 'wifeHowever',
 'little',
 'known',
 'the',
 'feelings',
 'or',
 'views',
 'of',
 'such',
 'a',
 'man',
 'may',
 'be',
 'on',
 'his',
 'first',
 'entering',
 'a',
 'neighbourhood',
 'this',
 'truth',
 'is',
 'so',
 'well',
 'fixed',
 'in',
 'the',
 'minds',
 'of',
 'the',
 'surrounding',
 'families',
 'that',
 'he',
 'is',
 'considered',
 'the',
 'rightful',
 'property',
 'of',
 'some',
 'one',
 'or',
 'other',
 'of',
 'their',
 'daughters',
 'My',
 'dear',
 'Mr',
 'Bennet',
 '',
 'said',
 'his',
 'lady',
 'to',
 'him',
 'one',
 'day',
 '',
 'have',
 'you',
 'heard',
 'that',
 'Netherfield',
 'Park',
 'is',
 'let',
 'at',
 'last',
 'Mr',
 'Bennet',
 'replied',
 'that',
 'he',
 'had',
 'not',
 'But',
 'it',
 'is',
 '',
 'returned',
 'she',
 '',
 'for',
 'Mrs',
 'Long',
 'has',
 'just',
 'bee

In [85]:
unique_lst = list(set(list_word))
unique_lst

['',
 'been',
 'some',
 'fortune',
 'invitation',
 'agreed',
 'at',
 'not',
 'This',
 'one',
 'here',
 'take',
 'must',
 'he',
 'my',
 'I',
 'she',
 'first',
 'day',
 'families',
 'Park',
 'in',
 'no',
 'replied',
 'hearing',
 'immediately',
 'before',
 'wife',
 'from',
 'be',
 'Mrs',
 'have',
 'want',
 'that',
 'delighted',
 'man',
 'Monday',
 'it',
 'Netherfield',
 'universally',
 'and',
 'may',
 'good',
 'me',
 'him',
 'such',
 'week',
 'possession',
 'You',
 'little',
 'neighbourhood',
 'is',
 'enough',
 'by',
 'so',
 'daughters',
 'returned',
 'made',
 'young',
 'has',
 'property',
 'surrounding',
 'of',
 'well',
 'England',
 'four',
 'entering',
 'was',
 'known',
 'see',
 'north',
 'the',
 'said',
 'had',
 'single',
 'house',
 'Bennet',
 'Long',
 'acknowledged',
 'views',
 'feelings',
 'Do',
 'considered',
 'for',
 'says',
 'on',
 'tell',
 'large',
 'his',
 'last',
 'dear',
 'chaise',
 'place',
 'rightful',
 'impatiently',
 'down',
 'know',
 'other',
 'Why',
 'next',
 'to',
 'wit

In [86]:
unique_dict = dict.fromkeys(list_word)
unique_dict

{'It': None,
 'is': None,
 'a': None,
 'truth': None,
 'universally': None,
 'acknowledged': None,
 'that': None,
 'single': None,
 'man': None,
 'in': None,
 'possession': None,
 'of': None,
 'good': None,
 'fortune': None,
 'must': None,
 'be': None,
 'want': None,
 'wifeHowever': None,
 'little': None,
 'known': None,
 'the': None,
 'feelings': None,
 'or': None,
 'views': None,
 'such': None,
 'may': None,
 'on': None,
 'his': None,
 'first': None,
 'entering': None,
 'neighbourhood': None,
 'this': None,
 'so': None,
 'well': None,
 'fixed': None,
 'minds': None,
 'surrounding': None,
 'families': None,
 'he': None,
 'considered': None,
 'rightful': None,
 'property': None,
 'some': None,
 'one': None,
 'other': None,
 'their': None,
 'daughters': None,
 'My': None,
 'dear': None,
 'Mr': None,
 'Bennet': None,
 '': None,
 'said': None,
 'lady': None,
 'to': None,
 'him': None,
 'day': None,
 'have': None,
 'you': None,
 'heard': None,
 'Netherfield': None,
 'Park': None,
 'let': N

In [87]:
for x in list_word:
    if unique_dict[x] is None:
        unique_dict[x] = 1
    else:
        unique_dict[x] += 1
unique_dict        

{'It': 1,
 'is': 8,
 'a': 8,
 'truth': 2,
 'universally': 1,
 'acknowledged': 1,
 'that': 8,
 'single': 1,
 'man': 3,
 'in': 5,
 'possession': 2,
 'of': 10,
 'good': 1,
 'fortune': 2,
 'must': 2,
 'be': 3,
 'want': 3,
 'wifeHowever': 1,
 'little': 1,
 'known': 1,
 'the': 8,
 'feelings': 1,
 'or': 2,
 'views': 1,
 'such': 1,
 'may': 1,
 'on': 2,
 'his': 5,
 'first': 1,
 'entering': 1,
 'neighbourhood': 1,
 'this': 1,
 'so': 2,
 'well': 1,
 'fixed': 1,
 'minds': 1,
 'surrounding': 1,
 'families': 1,
 'he': 5,
 'considered': 1,
 'rightful': 1,
 'property': 1,
 'some': 2,
 'one': 2,
 'other': 1,
 'their': 1,
 'daughters': 1,
 'My': 1,
 'dear': 2,
 'Mr': 4,
 'Bennet': 3,
 '': 6,
 'said': 1,
 'lady': 1,
 'to': 7,
 'him': 1,
 'day': 1,
 'have': 2,
 'you': 3,
 'heard': 1,
 'Netherfield': 2,
 'Park': 1,
 'let': 1,
 'at': 1,
 'last': 1,
 'replied': 1,
 'had': 1,
 'not': 2,
 'But': 1,
 'it': 5,
 'returned': 1,
 'she': 2,
 'for': 1,
 'Mrs': 2,
 'Long': 2,
 'has': 2,
 'just': 1,
 'been': 1,
 'here'

In [88]:
top_words = sorted(unique_dict.items(), key=lambda x: x[1], reverse=True)
top_words[:25]


[('of', 10),
 ('is', 8),
 ('a', 8),
 ('that', 8),
 ('the', 8),
 ('to', 7),
 ('', 6),
 ('in', 5),
 ('his', 5),
 ('he', 5),
 ('it', 5),
 ('and', 5),
 ('Mr', 4),
 ('man', 3),
 ('be', 3),
 ('want', 3),
 ('Bennet', 3),
 ('you', 3),
 ('truth', 2),
 ('possession', 2),
 ('fortune', 2),
 ('must', 2),
 ('or', 2),
 ('on', 2),
 ('so', 2)]