## Data in wrong format

In [1]:
import pandas as pd
import numpy as np

In [4]:
df = pd.read_csv('Financials_Sample_Data.csv', header=0)

In [5]:
df.head()

Unnamed: 0,Account,Businees Unit,Currency,Year,Scenario,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
0,Sales,Software,USD,2012,Actuals,"$90,924,002","$82,606,134","$72,780,220","$52,943,701","$77,528,109","$96,384,524","$77,345,061","$98,290,873","$79,879,127","$95,373,403","$54,887,908","$82,703,597"
1,Cost of Goods Sold,Software,USD,2012,Actuals,"($41,623,278)","($40,464,347)","($30,806,326)","($21,412,962)","($37,047,252)","($44,819,597)","($34,847,393)","($47,903,350)","($35,880,653)","($44,982,115)","($26,929,424)","($34,233,473)"
2,Commissions Expense,Software,USD,2012,Actuals,"($4,454,359)","($3,386,032)","($3,389,705)","($2,149,257)","($3,168,079)","($4,417,624)","($3,386,461)","($4,052,846)","($3,418,737)","($4,365,527)","($2,455,561)","($3,646,726)"
3,Payroll Expense,Software,USD,2012,Actuals,"($9,901,680)","($9,871,172)","($8,459,696)","($6,303,408)","($8,493,573)","($11,082,494)","($8,081,033)","($11,070,018)","($8,410,665)","($10,081,727)","($6,300,578)","($9,099,438)"
4,Travel & Entertainment Expense,Software,USD,2012,Actuals,"($951,255)","($838,985)","($872,700)","($624,416)","($919,835)","($1,085,296)","($818,602)","($1,040,585)","($803,190)","($1,158,623)","($611,335)","($941,542)"


In [6]:
df.describe()

Unnamed: 0,Year
count,351.0
mean,2017.923077
std,3.631184
min,2012.0
25%,2015.0
50%,2018.0
75%,2021.0
max,2023.0


When we look at the description of the data **'df.describe()'**, we see that other integer columns is not showing and that there are no calculations done on those columns. This could be because these columns are all in different formats. 
- To check the format of all columns, we will be using **print(df.dtypes)**

In [7]:
print(df.dtypes)

Account          object
Businees Unit    object
Currency         object
Year              int64
Scenario         object
Jan              object
Feb              object
Mar              object
Apr              object
May              object
Jun              object
Jul              object
Aug              object
Sep              object
Oct              object
Nov              object
Dec              object
dtype: object


### Now we need to convert these columns to their required data format

- One way of doing this is by using **'pd.to_numeric(df['column name'])'**
- Another way is by using **df['column name'] = df['column name'].astype('int')**

### This should work fine on other data, but not data used here because:
- The columns that needs convertion also has 'string' characters in them. So we first need to remove those characters using method below:

In [10]:
#To remove dollar ($) signs from entire data file
df.replace(to_replace='\$', value='', regex=True, inplace=True)

In [12]:
#DATA CLEANING
#clean data to get rid of bracket around each number
#can be used to remove all types of stuff

# Specify the columns where you want to remove brackets
columns_to_clean = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Remove brackets around numbers in the specified columns
for col in columns_to_clean:
    df[col] = df[col].replace(to_replace=r'\(([\d,]+)\)', value=r'\1', regex=True)

# Save the modified DataFrame back to a CSV file
df.to_csv('Modified_Employee_Data.csv', index=False, encoding='utf-8')


In [14]:
#DATA CLEANING
#removing the spaces and commas between numbers to make them more workable with in python

# Remove spaces and commas from numbers in the specified columns
for col in columns_to_clean:
    df[col] = df[col].replace(to_replace=[',', ' '], value='', regex=True)

# Convert the columns to integers
df[columns_to_clean] = df[columns_to_clean].astype(int)

# Save the modified DataFrame back to a CSV file
df.to_csv('Modified_Employee_Data.csv', index=False, encoding='utf-8')
