# Data Wrangling Fundamentals
> Data wrangling Exercises Chapter 1
- toc:true 
- branch:master
- badges:true
- comments:false
- categories: [jupyter]



Within the data science ecosystem, there is an overwhelming consensus that over ninety-five per cent of the data science process is data cleaning and wrangling. For such an important aspect of data science, you would expect greater attention, but the reality is that like human nature, we are almost always attracted to the shining parts of things. Consequently, beginners from the very beginning of their sojourn towards the unknown are lead astray. Do not be deceived; data science is data cleaning and data wrangling! As a veritable means of closing the yarning knowledge gap in the data wrangling and cleaning craft, I have decided to extract data-wrangling exercises from different resources. Thus this notebook and subsequent ones would highlight these exercises as well as their real-life application in the field of data science.

In [1]:
import numpy as np
import pandas as pd

In [2]:
a = np.random.randn(5,3)
a

array([[-1.12506181, -0.32504384,  1.67238151],
       [-0.04049187, -1.85380366, -1.47251248],
       [-1.39307679, -2.52397974,  0.96097851],
       [ 0.74104312, -0.14694375, -0.25140089],
       [-1.2898671 , -0.72692951, -1.49530227]])

## Steps of Data Wrangling


- Scraping raw data from multiple sources(including web and database table)
- Imputing(replacing missing data using various techniques), formating, and transforming-basically making it ready to be used in the modeling process
- Handling read/write errors
- Detecting outliers
- Performing quick visualizations(plotting) and basic statistical analyses to judge the quality of formatted data

## Accessing The List Members  Exercise:1.01


In [4]:
ssn = list(pd.read_csv('./The-Data-Wrangling-Workshop/chapter01/datasets/ssn.csv'))

In [None]:
# first Element
ssn[0]

In [None]:
# the 4th element
ssn[3]

In [None]:
# length of the list
ssn[len(ssn)- 1]

In [None]:
ssn[-1]

In [None]:
# first three element
ssn[:3]

In [None]:
#last two element
ssn[-2:]

In [None]:
# first two element with backward indices
ssn[:-2]

In [None]:
#reverse the element in the list
ssn[-1::-1]

In [None]:
ssn[:]

### Generating and Iterating through a List: 1.02

In [None]:
# Using the append method to iterate over a list and add element unto another list
ssn_2 = []
for i in ssn:
    ssn_2.append(i)
ssn_2


In [None]:
# using list comprehension to generate list
ssn3 = ['soc: '+ x for x in ssn_2]
ssn3

In [None]:
i = 0
while i < len(ssn3):
    print(ssn3[i])
    i += 1

In [None]:
# search for ssn with 5 in the number
numbers = [x for x in ssn3 if '5' in x]
numbers

In [None]:
# usign + sign to extend
ssn_4 = ['102-90-0314','247-17-2338','318-22-2760']
ssn_5 = ssn_4 + ssn
ssn_5

In [None]:
# usign the extend method
ssn_2.extend(ssn_4)
ssn_2

In [None]:
# #nested list
# for x in ssn_2:
#     for y in ssn_5:
#         print(str(x) + ',' + str(y))

### Iterating Over a List and Checking Membership: 1.03

In [None]:
car_model = list(pd.read_csv('./The-Data-Wrangling-Workshop/chapter01/datasets/car_models.csv'))
car_model

In [None]:
# Iterate over a list non-pythonic way
list_1 = [x for x in car_model]
for i in range(0,len(list_1)):
    print(list_1[i])

In [None]:
# iterating in a pythonic manner
for i in list_1:
    print(i)

In [None]:
'D150' in list_1, 'Mustang' in list_1

### Sorting A List: Exercise 1.04

In [None]:
list_1 = [*range(0, 21,1)]
list_1.sort(reverse=True)
list_1

In [None]:
list_1.reverse()
list_1

In [None]:
#

In [None]:
# List Of Random number
list_2 = [ x**2 for x in list_1]
list_2

In [None]:
from math import log
import random

In [None]:
list_1a = [random.randint(0,30) for x in range(0,100)]


In [None]:
# find the sqaure root of each
sqrty = [randy**2 for randy in list_1a]


In [None]:
# log of 1 elements of sqrty
log_lst = [log(x + 1, 10) for x in sqrty]


### Activity 1.01 Handling List

In [None]:
# list of 100 random number
hundred_rand = [random.randint(0,30) for x in range(0,101)]


In [None]:
div_three = [x for x in hundred_rand if x % 3==0]


In [None]:
# difference in length of list
diff_len = len(hundred_rand) - len(div_three)

In [None]:
new_lst = []
number_of_experiment = 10
for g in range(0, number_of_experiment):
    randyx = [random.randint(0,100) for x in range(0,100)]
    div_3x = [x for x in randyx if x % 3==0]
    diff_len = len(randyx) - len(div_3x)
    new_lst.append(diff_len)
new_lst
    

In [None]:
from scipy import mean

In [None]:
#average or mean
the_mean = mean(new_lst)
the_mean


## Introduction to Sets


In [None]:
list_12 = list(set(hundred_rand))
list_12


### Union and Intersection of Set


In [None]:
set_1 = {'Apple', 'Orange', 'Banana'}
set_2 = {'Pear', 'Peach', 'Mango', 'Banana'}
#the union of two set is ..
set_1 | set_2

In [None]:
# intersection of two sets
set_1 & set_2

### Creating Null set


In [None]:
# to create null set
non_set = set({})
non_set

## Dictionary 

In [None]:
dict_1 = {'key1':'value1', 'key2':'value2'}


### Accessing and Setting Values in a dictionary

### Revisiting the unique Valued List Problem
- `dict()` `fromkeys()` and `keys()`

In [None]:
# generate a random list with duplicate values
list_rand = [random.randint(0,30) for x in range(0,100)]



In [None]:
# Create a unique valid list from list_rand
list(dict.fromkeys(list_rand).keys())

### Deleting a Value From Dict Ex.1.09
Involves deleting a value from dict using the del method

In [None]:
dict_1 = {"key1": 1, "key2": ["list_element1", 34], "key3": "value3", 

          "key4": {"subkey1": "v1"}, "key5": 4.5} 
dict_1 

In [None]:
# use the del function and specifiy the element to be deleted
del dict_1['key2']
dict_1

In [None]:
#delete key3 and key4
del dict_1['key3']
del dict_1['key4']
dict_1

### Dictionary Comprehension ex 1.10
Dictionary comprehension though rarely used but could come handle in the processing of creating important key-value pairs form of data like names of customer and their age, credit card customer and their owners

In [None]:
list_dict = [x for x in range(0,10)]
dict_comp = {x: x**2 for x in list_dict}
dict_comp

In [None]:
#generate a dictionary with using a list
## using the dict fuction
dict_2 = dict([('Tom',100),('Dick',200),('Harry',300)])
dict_2

In [None]:
# using the dict function to create dictionary
dict_3 = dict(Tom=100, Dick=200,Harry=300)
dict_3

## Tuples


- A unique feature of tuple is that of immutability. That is once crested it can not be updated by way of adding or removing from it

- tuple consist of values separated by comma 

In [None]:
tuple_1 = 24,42,2.3456, 'Hello'

- the length of the tuple is called **cardinality**

### Creating a Tuple with Different Cardinality

In [None]:
# creating an empty tuple
tuple_1 = ()

In [None]:
# tuple with only one value. The trailing comma must follow
tuple_1 = 'Hello',

In [None]:
# nesting tuple like list
tuple_1 = 'hello', 'there'
tuple_12 = tuple_1, 45, 'Sam'
tuple_12

In [None]:
# # the immutability of tuple
# tuple_1 = 'Hello', 'World!'
# tuple_1[1] = 'Universe'

In [None]:
# access elements in a tuple
tuple_1 = ('good','morning!', 'how','are','you?')
tuple_1[0]

In [None]:
tuple_1[4]

### Unpacking a Tuple

In [None]:
tuple_1 = 'Hello', 'World'
hello, world = tuple_1
print(hello)
print(world)

### Handling Tuple Ex 1.11

In [None]:
tupleE = '1', '3', '5'
tupleE

In [None]:
# print variables at 0th and 1st
print(tupleE[0])
print(tupleE[1])

## Strings


- An important feature of string is that it's immutable


### Accessing String Ex1.12

In [None]:
#create a string
string_1 = "Hello World!"
string_1

In [None]:
# access the first member of the string
string_1[0]

In [None]:
#access the fifth member of the string
string_1[4]

In [None]:
# access the last member of the string
string_1[-1]

### String Slices Ex 1.13

In [None]:
#create string
string_a = "Hello World! I am Learning data wrangling"
string_a

In [None]:
#specifiy the slicing values and slice the sring
string_a[2:10]

In [None]:
#by skipping a slice value
string_a[-31:]

In [None]:
#using negative number for slicing
string_a[-10:-5]

### String Functions

In [None]:
# find the length of a string with len()
len(string_a)

In [None]:
# convert string case
## use lower() and upper() methods
str_1 = "A COMPLETE UPER CASE STRING"


In [None]:
str_1.lower()

In [None]:
str_1.upper()

In [None]:
#search for a string within a string
## use find method
str_1 = "A complicated string look like this"

In [None]:
str_1.find('complicated')

In [None]:
str_1.find('hello')

In [None]:
# to replace a string with another
### use the replace method
str_1

In [None]:
str_1.replace('complicated', 'simple')

### Splitting and Joining String Ex 1.14
- split and join methods
- use str.split(separator)
- use str.join(separator)

In [None]:
#create a string and convert it into a list
## using split
str_1 = "Name, age, Sex, Address"
list_1 = str_1.split(',')
list_1

In [None]:
# combine list into another string
s = '|'
s.join(list_1)

### Activity 1.02 Analyzing a Multi-line String and Generating the Unique Word Count

In [None]:
multiline_text= """It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

However little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.

"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"

Mr. Bennet replied that he had not.

"But it is," returned she; "for Mrs. Long has just been here, and she told me all about it."

Mr. Bennet made no answer.

"Do you not want to know who has taken it?" cried his wife impatiently.

"You want to tell me, and I have no objection to hearing it."

This was invitation enough.

"Why, my dear, you must know, Mrs. Long says that Netherfield is taken by a young man of large fortune from the north of England; that he came down on Monday in a chaise and four to see the place, and was so much delighted with it, that he agreed with Mr. Morris immediately; that he is to take possession before Michaelmas, and some of his servants are to be in the house by the end of next week."

"What is his name?"""

In [None]:
multiline_text

In [None]:
# find its type and length
type(multiline_text), len(multiline_text)

In [None]:
# remove all symbols and new lines using the replace
multiline = multiline_text.replace('\n','').replace('?','').replace('.','').replace(';','').replace(',','').replace('"',' ')
multiline

In [None]:
list_word = multiline.split(' ')
list_word

In [None]:
unique_lst = list(set(list_word))
unique_lst

In [None]:
unique_dict = dict.fromkeys(list_word)
unique_dict

In [None]:
for x in list_word:
    if unique_dict[x] is None:
        unique_dict[x] = 1
    else:
        unique_dict[x] += 1
unique_dict        

In [None]:
top_words = sorted(unique_dict.items(), key=lambda x: x[1], reverse=True)
top_words[:25]
