## Pandas part - 2

**In Part we are going to learn about**

- StringIO
- Pandas read_csv



```python
import pandas as pd

data = '''name,age,city
Alice,25,New York
Bob,30,London
Charlie,35,Sydney'''

df = pd.read_csv(pd.compat.StringIO(data))

print(df)
```

**Explanation:**

The given code uses `StringIO` and `pd.read_csv` to create a DataFrame (`df`) from a CSV-like string (`data`). It reads the string as a CSV file and creates a DataFrame with three columns: 'name', 'age', and 'city'.

**Optimization:**

To optimize this code, we can eliminate the use of `StringIO` and directly use `pd.DataFrame` to create the DataFrame from the dictionary-like data.

**Optimized Code:**

```python
import pandas as pd

data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['New York', 'London', 'Sydney']
}

df = pd.DataFrame(data)

print(df)
```

**Explanation:**

In the optimized code, we directly pass a dictionary-like data to `pd.DataFrame`, which eliminates the need for `StringIO` and `pd.read_csv`. This approach simplifies the code, making it smaller, faster, and adhering to the KISS (Keep It Simple, Stupid) and DRY (Don't Repeat Yourself) principles. Additionally, this optimization follows clean code architecture principles and SOLID principles, as it removes unnecessary complexity and follows a straightforward approach.

In [1]:
#  Reading Different Data sources with the help of pandas

from io import StringIO

In [5]:
import pandas as pd
df=pd.read_csv('tvmarketing.csv')
df.head()

Unnamed: 0,TV,Sales
0,230.1,22.1
1,44.5,10.4
2,17.2,9.3
3,151.5,18.5
4,180.8,12.9


In [6]:
type(df)

pandas.core.frame.DataFrame

In [7]:
data=('col1,col2,col3\n'
     'x,y,1\n'
     'a,b,2\n'
     'c,d,3\n'
     'e,f,4')
data

'col1,col2,col3\nx,y,1\na,b,2\nc,d,3\ne,f,4'

In [8]:
type(data)

str

In [9]:
##in memeory file format object
StringIO(data)

<_io.StringIO at 0x225c7703e20>

In [10]:
pd.read_csv(StringIO(data))

Unnamed: 0,col1,col2,col3
0,x,y,1
1,a,b,2
2,c,d,3
3,e,f,4


In [11]:
pd.read_csv(StringIO(data),usecols=['col1','col2'])

Unnamed: 0,col1,col2
0,x,y
1,a,b
2,c,d
3,e,f


In [12]:
import pandas as pd
df=pd.read_csv('mercedesbenz.csv',usecols=['X0','X1','X2','X3','X4','X5'])
df.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5
0,k,v,at,a,d,u
1,k,t,av,e,d,y
2,az,w,n,c,d,x
3,az,t,n,f,d,x
4,az,v,n,f,d,h


In [13]:
# to convert dataset in to csv file, file will save in ur dir

df.to_csv('test.csv',index=False)

In [16]:
## datatypes in csv, another exp with NaN value
data = ('a,b,c,d\n'
            '1,2,3,4\n'
            '5,6,7,8\n'
            '9,10,11')

In [20]:
df=pd.read_csv(StringIO(data))
df

Unnamed: 0,a,b,c,d
0,1,2,3,4.0
1,5,6,7,8.0
2,9,10,11,


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      int64  
 2   c       3 non-null      int64  
 3   d       2 non-null      float64
dtypes: float64(1), int64(3)
memory usage: 228.0 bytes


In [21]:
df=pd.read_csv(StringIO(data),dtype='object')

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   a       3 non-null      object
 1   b       3 non-null      object
 2   c       3 non-null      object
 3   d       2 non-null      object
dtypes: object(4)
memory usage: 228.0+ bytes


In [23]:
df.head()

Unnamed: 0,a,b,c,d
0,1,2,3,4.0
1,5,6,7,8.0
2,9,10,11,


In [24]:
df.isnull()

Unnamed: 0,a,b,c,d
0,False,False,False,False
1,False,False,False,False
2,False,False,False,True


In [28]:
df.isnull().sum()

a    0
b    0
c    0
d    1
dtype: int64

In [29]:
df['a'][0]

'1'

In [33]:
#  datatype in csv

data = ('a,b,c,d\n'
            '1,2,3,4\n'
            '5,6,7,8\n'
            '9,10,11')
data

'a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11'

In [35]:
df = pd.read_csv(StringIO(data))

In [36]:
df

Unnamed: 0,a,b,c,d
0,1,2,3,4.0
1,5,6,7,8.0
2,9,10,11,


In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int64  
 1   b       3 non-null      int64  
 2   c       3 non-null      int64  
 3   d       2 non-null      float64
dtypes: float64(1), int64(3)
memory usage: 228.0 bytes


In [38]:
df = pd.read_csv(StringIO(data),dtype={'a':int,'b':float,'c':int})

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       3 non-null      int32  
 1   b       3 non-null      float64
 2   c       3 non-null      int32  
 3   d       2 non-null      float64
dtypes: float64(2), int32(2)
memory usage: 204.0 bytes


In [41]:
df.dtypes

a      int32
b    float64
c      int32
d    float64
dtype: object

In [47]:
data = ('index,a,b,c\n'
           '4,apple,bat,5.7\n'
            '8,orange,cow,10')

In [48]:
df = pd.read_csv(StringIO(data))

In [49]:
df

Unnamed: 0,index,a,b,c
0,4,apple,bat,5.7
1,8,orange,cow,10.0


In [57]:


data = '''name,age,city
Alice,25,New York
Bob,30,London
Charlie,35,Sydney'''


data

'name,age,city\nAlice,25,New York\nBob,30,London\nCharlie,35,Sydney'

In [56]:
# Correct column names in usecols and set the first column as the index
df = pd.read_csv(StringIO(data), usecols=['name', 'age', 'city'], index_col='name')

print(df)

         age      city
name                  
Alice     25  New York
Bob       30    London
Charlie   35    Sydney


In [60]:

# Creating the dummy dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Sydney', 'Los Angeles'],
    'Salary': [50000, 60000, 75000, 45000]
}

# Creating a DataFrame from the data
df = pd.DataFrame(data)

# Saving the DataFrame to a tab-separated file
df.to_csv('dummy_dataset.tsv', sep='\t', index=False)
df

Unnamed: 0,Name,Age,City,Salary
0,Alice,25,New York,50000
1,Bob,30,London,60000
2,Charlie,35,Sydney,75000
3,David,28,Los Angeles,45000
